[[breakout]]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
breakout [2018/12/27 09:30]
beckmanf [Running long jobs]
breakout [2018/12/27 09:40]
beckmanf [Running long jobs]
Line 162: Line 162:
 When you login to the breakout via your RZ account, then your home directory is mounted on the breakout from the RZ file server via nfs. When you logout from the breakout, then your home directory is unmounted after 5 minutes if you have no job still running. If you have a job running, e.g. via tmux or a job in the background then your home directory remains mounted. ​ When you login to the breakout via your RZ account, then your home directory is mounted on the breakout from the RZ file server via nfs. When you logout from the breakout, then your home directory is unmounted after 5 minutes if you have no job still running. If you have a job running, e.g. via tmux or a job in the background then your home directory remains mounted. ​
  
-If you leave a job running for more than about 10 hours you get errors when you try to access files in your home directory. The reason is that the mounting process requires an authentification ​which is done via the kerberos service. When you login to the breakout with your password, then you automagically receive a kerberos ticket which is derived from the login credentials. This is required by the automounter of your home directory - without a kerberos ticket the nfs server does not allow the access to your files. When I run the pytorch example [[#Running the imagenet training]], then this takes about 5 days. After approximately 10 hours runtime I receive the following bus error message+If you leave a job running for more than about 10 hours you get errors when you try to access files in your home directory. The reason is that the mounting process requires an authentication ​which is done via the kerberos service. When you login to the breakout with your password, then you automagically receive a kerberos ticket which is derived from the login credentials. This is required by the automounter of your home directory - without a kerberos ticket the nfs server does not allow the access to your files. When I run the pytorch example [[#Running the imagenet training]], then this takes about 5 days. After approximately 10 hours runtime I receive the following bus error message
  
 <​code>​ <​code>​
Line 199: Line 199:
 </​code>​ </​code>​
  
-The kerberos ticket lifetime 10h and the renew time is 24h. So after 18:28:43 you cannot access your home directory anymore. You can apply for a new ticket with longer lifetime and a longer renew time with "​kinit"​.+The kerberos ticket lifetime ​is 10h and the renew time is 24h. So after 18:28:43 you cannot access your home directory anymore. You can apply for a new ticket with longer lifetime and a longer renew time with "​kinit"​.
  
 <​code>​ <​code>​
Line 206: Line 206:
 </​code>​ </​code>​
  
-This applies ​for ticket lifetime of 2 days and a renew time of 7 days. You can check the result with klist again.+In the example above you apply for ticket lifetime of 2 days and a renew time of 7 days. You can check the result with klist again.
  
 <​code>​ <​code>​
 beckmanf@breakout:​~$ klist beckmanf@breakout:​~$ klist
-Ticket cache: FILE:/tmp/krb5cc_12487+Ticket cache: FILE:/tmp/krb5cc_12487_ssddef
 Default principal: beckmanf@RZ.HS-AUGSBURG.DE Default principal: beckmanf@RZ.HS-AUGSBURG.DE
  
Line 217: Line 217:
  renew until 03.01.2019 08:30:05  renew until 03.01.2019 08:30:05
 </​code>​ </​code>​
 +
 +The kerberos ticket lifetime is still only 10h but the renew time is now seven days.
 +
 +== Renew a kerberos ticket ==
 +
 +To get a new kerberos ticket you have to provide your password. But you can renew your ticket and extend the lifetime without a password until the maximum renew time expires. You must have a valid non-expired ticket when you start the renew process. In the example above you would have to do the renew until 18:30:09. You can renew with "kinit -R". You do not need a password to do that.
 +
 +== Start a job with automatic kerberos ticket renew ==
 +
 +You can do the ticket renew process automatically. When you start a job with "​krenew",​ then your existing kerberos ticket will be copied to a new ticket cache location and the renew process is automatically done until the renew time expires or the job is done. To start the example from pytorch imagenet training, this would be done like this:
 +
 +<​code>​
 +krenew python -- main.py --gpu=2 -a resnet18 /​fast/​imagenet
 +</​code>​
 +
 +If you do this inside a tmux session, then you detach and logout. The job will run for up to seven days.
 ==== PyTorch ==== ==== PyTorch ====
  
  • breakout.txt
  • Last modified: 2022/03/26 17:38
  • by beckmanf