[[breakout]]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
breakout [2018/12/26 08:57]
beckmanf [PyTorch]
breakout [2020/12/23 12:55] (current)
beckmanf added Deskproto breakout install
Line 11: Line 11:
   * Intel X540-T2 10GB Base-T Ethernet Netzwerkanschluss   * Intel X540-T2 10GB Base-T Ethernet Netzwerkanschluss
   * 4 x NVIDIA Geforce GTX 1080 mit GP104 Pascal, 2560 Cores, 8 GB RAM   * 4 x NVIDIA Geforce GTX 1080 mit GP104 Pascal, 2560 Cores, 8 GB RAM
-  * Debian Linux Jessie, NVIDIA Cuda, Torch +  * Debian Linux Jessie 
-  * NVidia Treiber ​410.78 +  * NVidia Treiber ​450.80.02 
-  * Kernel ​3.16.51-3 +  * Kernel ​4.9.0-13 
-  * Cuda 10+  * Cuda 10, Cuda 8 
 +  * Tensorflow, Torch 
 +  * Docker 19.03.13, Nvidia-docker
  
 ===== Nutzungshinweise ===== ===== Nutzungshinweise =====
Line 134: Line 136:
 </​code>​ </​code>​
  
-überprüft werden. Im Beispiel oben kann man sehen, dass es +überprüft werden. Im Beispiel oben kann man sehen:
  
-  * Vier GeForce GTX 1080 Grafikkarten ​gibt+  * Es gibt vier GeForce GTX 1080 Grafikkarten
   * Grafikkarte "​3"​ ist gerade in Betrieb - der Lüfter läuft auf 90% und die Temperatur beträgt 76 GradC   * Grafikkarte "​3"​ ist gerade in Betrieb - der Lüfter läuft auf 90% und die Temperatur beträgt 76 GradC
   * Der Prozess mit Process ID 14538 "​python"​ läuft auf Karte 3. Der Speicher ist  mit 7323 MiB fast voll.   * Der Prozess mit Process ID 14538 "​python"​ läuft auf Karte 3. Der Speicher ist  mit 7323 MiB fast voll.
 +
 +==== Running long jobs ====
 +
 +=== tmux - Keep a session running even when you logout ===
 +
 +With tmux you can keep a session running even when you logout. You can later login again and the session is still there. Create a new session:
 +
 +<​code>​
 +tmux new-session -s fredo
 +</​code>​
 +
 +Now you can start a program. You can leave the tmux session (and the program) running when you type CTRL-b d. This will detach you from the tmux session. Then you can logout from you ssh session and keep everything running on the breakout. Then you can login to breakout via ssh again. You can reattach to tmux with
 +
 +<​code>​
 +tmux attach-session -t fredo
 +</​code>​
 +
 +You should see the output from your running program.
 +
 +=== kerberos - keep your file system alive ===
 +
 +When you login to the breakout via your RZ account, then your home directory is mounted on the breakout from the RZ file server via nfs. When you logout from the breakout, then your home directory is unmounted after 5 minutes if you have no job still running. If you have a job running, e.g. via tmux or a job in the background then your home directory remains mounted. ​
 +
 +If you leave a job running for more than about 10 hours you get errors when you try to access files in your home directory. The reason is that the mounting process requires an authentication which is done via the kerberos service. When you login to the breakout with your password, then you automagically receive a kerberos ticket which is derived from the login credentials. This is required by the automounter of your home directory - without a kerberos ticket the nfs server does not allow the access to your files. When I run the pytorch example [[#Running the imagenet training]], then this takes about 5 days. After approximately 10 hours runtime I receive the following bus error message
 +
 +<​code>​
 +Epoch: [12][4980/​5005] ​ Time 0.523 (0.524) ​     Data 0.000 (0.034) ​     Loss 2.5527 (2.5143) ​   Acc@1 44.922 (44.781) ​  Acc@5 69.922 (69.733)
 +Epoch: [12][4990/​5005] ​ Time 0.525 (0.524) ​     Data 0.000 (0.034) ​     Loss 2.7477 (2.5144) ​   Acc@1 44.141 (44.778) ​  Acc@5 66.016 (69.732)
 +Epoch: [12][5000/​5005] ​ Time 0.520 (0.524) ​     Data 0.000 (0.034) ​     Loss 2.3334 (2.5144) ​   Acc@1 46.094 (44.776) ​  Acc@5 70.312 (69.730)
 +Test: [0/​196] ​  Time 3.587 (3.587) ​     Loss 1.6937 (1.6937) ​   Acc@1 58.203 (58.203) ​  Acc@5 86.328 (86.328)
 +Test: [10/​196] ​ Time 0.159 (0.814) ​     Loss 2.3972 (2.0702) ​   Acc@1 39.062 (51.598) ​  Acc@5 75.391 (77.131)
 +...
 +Test: [170/196] Time 2.123 (0.635) ​     Loss 1.9238 (2.3964) ​   Acc@1 46.094 (45.463) ​  Acc@5 81.641 (72.149)
 +Test: [180/196] Time 0.159 (0.630) ​     Loss 2.1114 (2.4070) ​   Acc@1 44.531 (45.254) ​  Acc@5 78.125 (71.996)
 +Test: [190/196] Time 1.742 (0.633) ​     Loss 1.7933 (2.3935) ​   Acc@1 53.516 (45.492) ​  Acc@5 87.891 (72.215)
 + * Acc@1 45.864 Acc@5 72.442
 +Traceback (most recent call last):
 +  File "​main.py",​ line 398, in <​module>​
 +  File "​main.py",​ line 113, in main
 +...
 +  File "/​rz2home/​beckmanf/​miniconda3/​lib/​python3.7/​site-packages/​torch/​serialization.py",​ line 141, in _with_file_like
 +PermissionError:​ [Errno 13] Permission denied: '​checkpoint.pth.tar'​
 +Bus-Zugriffsfehler
 +beckmanf@breakout:​~/​pytorch/​examples/​imagenet$ ​
 +</​code>​
 +
 +The reason for this bus error is that the pytorch program tries to write the file "​checkpoint.pth.tar"​ to the home directory but the home directory cannot be accessed because of the kerberos ticket expired.
 +
 +You can check the status of your current kerberos ticket with "​klist"​.
 +
 +<​code>​
 +beckmanf@breakout:​~$ klist
 +Ticket cache: FILE:/​tmp/​krb5cc_12487_ssddef
 +Default principal: beckmanf@RZ.HS-AUGSBURG.DE
 +
 +Valid starting ​      ​Expires ​             Service principal
 +27.12.2018 08:​28:​43 ​ 27.12.2018 18:​28:​43 ​ krbtgt/​RZ.HS-AUGSBURG.DE@RZ.HS-AUGSBURG.DE
 + renew until 28.12.2018 08:28:37
 +</​code>​
 +
 +The kerberos ticket lifetime is 10h and the renew time is 24h. So after 18:28:43 you cannot access your home directory anymore. You can apply for a new ticket with longer lifetime and a longer renew time with "​kinit"​.
 +
 +<​code>​
 +beckmanf@breakout:​~$ kinit -l 2d -r 7d
 +Password for beckmanf@RZ.HS-AUGSBURG.DE: ​
 +</​code>​
 +
 +In the example above you apply for a ticket lifetime of 2 days and a renew time of 7 days. You can check the result with klist again.
 +
 +<​code>​
 +beckmanf@breakout:​~$ klist
 +Ticket cache: FILE:/​tmp/​krb5cc_12487_ssddef
 +Default principal: beckmanf@RZ.HS-AUGSBURG.DE
 +
 +Valid starting ​      ​Expires ​             Service principal
 +27.12.2018 08:​30:​09 ​ 27.12.2018 18:​30:​09 ​ krbtgt/​RZ.HS-AUGSBURG.DE@RZ.HS-AUGSBURG.DE
 + renew until 03.01.2019 08:30:05
 +</​code>​
 +
 +The kerberos ticket lifetime is still only 10h but the renew time is now seven days.
 +
 +== Renew a kerberos ticket ==
 +
 +To get a new kerberos ticket you have to provide your password. But you can renew your ticket and extend the lifetime without a password until the maximum renew time expires. You must have a valid non-expired ticket when you start the renew process. In the example above you would have to do the renew until 18:30:09. You can renew with "kinit -R". You do not need a password to do that.
 +
 +== Start a job with automatic kerberos ticket renew ==
 +
 +You can do the ticket renew process automatically. When you start a job with "​krenew",​ then your existing kerberos ticket will be copied to a new ticket cache location and the renew process is automatically done until the renew time expires or the job is done. The ticket cache is copied because the kerberos cache that you received at login (here: /​tmp/​krb5cc_12487_ssddef) will be deleted at logout. To start the example from pytorch imagenet training, this would be done like this:
 +
 +<​code>​
 +krenew python -- main.py --gpu=2 -a resnet18 /​fast/​imagenet
 +</​code>​
 +
 +If you do this inside a tmux session, then you can detach and logout. The job will run for up to seven days. When you login later you can check the status of the jobs kerberos ticket again with klist. You have to provide the filename of the jobs ticket cache.
 +
 +<​code>​
 +klist /​tmp/​krb5cc_12487_ftXjk0
 +</​code>​
 +
 +In my example the new cache name from krenew was /​tmp/​krb5cc_12487_ftXjk0. ​
 +
 +== Login via Public Key Authentication ==
 +
 +When you login via Public Key Authentication,​ then you do not receive a new kerberos ticket. If you do not have a valid kerberos ticket, then you cannot access "​$HOME/​.ssh/​authorized_keys"​ and you are falling back to default password login and receive a new kerberos ticket. If you did the login via Public Key, then your "​klist"​ will not show any kerberos ticket because that is active from some other login session. However you can still run "​kinit"​ and receive a new kerberos ticket. That will be stored in the default kerberos ticket cache location at "/​tmp/​krb5cc_<​uid>"​. ​
 +==== PyTorch ====
 +
 +I installed [[http://​pytorch.org|PyTorch]] via miniconda in my home directory. Anaconda/​Miniconda is an installation method for python tools. The installation of miniconda is described [[https://​conda.io/​docs/​user-guide/​install/​linux.html|here]]. I used the 64 Bit version for python 3.7. The download is [[https://​conda.io/​miniconda.html|here]]. So I did:
 +
 +<​code>​
 +cd
 +wget https://​repo.continuum.io/​miniconda/​Miniconda3-latest-Linux-x86_64.sh
 +bash Miniconda3-latest-Linux-x86_64.sh
 +conda update conda
 +</​code>​
 +
 +The conda files are installed in your home directory under $HOME/​miniconda3. You have to add the path to the conda binaries to your PATH variable by adding this section
 +
 +<​code>​
 +if [ -d "​$HOME/​miniconda3"​ ]; then
 +  export PATH=$HOME/​miniconda3/​bin:​$PATH
 +fi
 +</​code>​
 +
 +to your .profile file in your home directory. The you have to logout and login again. Now the conda program should be available. Check with:
 +
 +<​code>​
 +beckmanf@breakout:​~$ which conda
 +/​rz2home/​beckmanf/​miniconda3/​bin/​conda
 +</​code>​
 +
 +Now you can update the conda installations with:
 +
 +<​code>​
 +conda update conda
 +</​code>​
 +
 +The [[http://​pytorch.org|installation of PyTorch]] is done via 
 +
 +<​code>​
 +conda install pytorch torchvision -c pytorch
 +</​code>​
 +
 +=== Running the CIFAR-10 Tutorial tutorial via jupyter notebook ===
 +
 +I did the [[http://​pytorch.org/​tutorials/​beginner/​blitz/​cifar10_tutorial.html|CIFAR-10 classifier tutorial]] via a [[http://​jupyter.org|jupyter notebook]]. Jupyter notebook is a webfrontend such that
 +the python code can be executed via a webbrowser. To install the jupyter framework I installed
 +
 +<​code>​
 +conda install notebook
 +</​code>​
 +
 +<​code>​
 +cd
 +mkdir -p pytorch/​cifar10
 +cd pytorch/​cifar10
 +beckmanf@breakout:​~/​pytorch/​cifar10$ jupyter notebook --no-browser
 +[I 11:​59:​55.306 NotebookApp] The port 8888 is already in use, trying another port.
 +[I 11:​59:​55.405 NotebookApp] Serving notebooks from local directory: /​rz2home/​beckmanf/​pytorch/​cifar10
 +[I 11:​59:​55.405 NotebookApp] 0 active kernels
 +[I 11:​59:​55.405 NotebookApp] The Jupyter Notebook is running at:
 +[I 11:​59:​55.405 NotebookApp] http://​localhost:​8889/?​token=3d22f49d309a3e4fc0834dd58e3f7f36152d34e7a318aa3a
 +[I 11:​59:​55.405 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
 +[C 11:​59:​55.405 NotebookApp] ​
 +    ​
 +    Copy/paste this URL into your browser when you connect for the first time,
 +    to login with a token:
 +        http://​localhost:​8889/?​token=3d22f49d309a3e4fc0834dd58e3f7f36152d34e7a318aa3a
 +</​code>​
 +
 +In this example the jupyter web server is at port number 8889 on the breakout. The breakout is configured such that this port can NOT be reached from outside. Therefore you have to tunnel this port via ssh to your client machine. So do the following on your client with your account name.
 +
 +<​code>​
 +FriedrichsMacBook:​~ fritz$ ssh -p 2222 -L 8889:​localhost:​8889 beckmanf@breakout.hs-augsburg.de
 +</​code>​
 +
 +Now you can open the jupyter notebook via a local webbrowser on your client machine. The url is the one which was given above including the token.
 +
 +=== Running the imagenet training ===
 +
 +The [[http://​image-net.org/​challenges/​LSVRC/​2012/​index|imagenet-12 dataset]] is a set of 1.3 million images which are hand labeled and categorized in 1000 categories. The data is available on the breakout at /​fast/​imagenet. The training is done with the pytorch examples. Install the pytorch examples from the git repository:
 +
 +<​code>​
 +cd
 +cd pytorch
 +git clone https://​github.com/​pytorch/​examples.git
 +cd examples
 +cd imagenet
 +</​code>​
 +
 +Now you can run the pytorch imagenet training with
 +
 +<​code>​
 +python main.py --gpu=2 -a resnet18 /​fast/​imagenet
 +</​code>​
 +
 +The training takes about 5 days on the breakout. Refer to [[#Running long jobs]] to see how you can run that long jobs on the breakout.
 +
 +==== Bauingenieure - Photoscan ====
 +
 +The photoscan software is installed under /​opt/​photoscan-pro. To run the software via the graphical user interface start the gui session via vncserver as described above. Then open a terminal and start photoscan via:
 +
 +=== Start the Software ===
 +
 +<​code>​
 +vglrun /​opt/​photoscan-pro/​photoscan.sh
 +</​code>​
 +
 +=== License Activation ===
 +The software is currently installed with root as owner. Therefore only root can update the software and the license. To update the license, do:
 +
 +<​code>​
 +sudo /​opt/​photoscan-pro/​photoscan.sh --activate EGKKS-KRNPU-LRMLE-RJDTS-GE4SK
 +</​code>​
  
 ==== Torch ==== ==== Torch ====
Line 250: Line 465:
 docker rm digits docker rm digits
 </​code>​ </​code>​
 +
  
 ==== Tensorflow ==== ==== Tensorflow ====
Line 293: Line 509:
 Once you reconnected to the server, you are ready to use python3 with TensorFlow. Once you reconnected to the server, you are ready to use python3 with TensorFlow.
  
-==== PyTorch ​====+==== Deskproto ​====
  
-I installed [[http://​pytorch.org|PyTorch]] via miniconda in my home directory. Anaconda/​Miniconda is an installation method for python tools. ​The installation of miniconda ​is described [[https://​conda.io/​docs/​user-guide/​install/​linux.html|here]]. I used the 64 Bit version for python 3.7. The download is [[https://​conda.io/​miniconda.html|here]]. So I did:+The Deskproto CAM software ​is installed and can be started from with GUI
  
 <​code>​ <​code>​
-cd +vglrun -display ​:0.3 /opt/deskproto/DeskProto_7.0_de_Linux_20200909-x86_64_Rev9761.AppImage ​
-wget https://repo.continuum.io/miniconda/​Miniconda3-latest-Linux-x86_64.sh +
-bash Miniconda3-latest-Linux-x86_64.sh +
-conda update conda+
 </​code>​ </​code>​
  
-The conda files are installed ​in your home directory under $HOME/​miniconda3. You have to add the path to the conda binaries to your PATH variable by adding this section+The display option ​in the example above will result in running on GPU 3. 
  
-<​code>​ 
-if [ -d "​$HOME/​miniconda3"​ ]; then 
-  export PATH=$HOME/​miniconda3/​bin:​$PATH 
-fi 
-</​code>​ 
- 
-to your .profile file in your home directory. The you have to logout and login again. Now the conda program should be available. Check with: 
- 
-<​code>​ 
-beckmanf@breakout:​~$ which conda 
-/​rz2home/​beckmanf/​miniconda3/​bin/​conda 
-</​code>​ 
- 
-Now you can update the conda installations with: 
- 
-<​code>​ 
-conda update conda 
-</​code>​ 
- 
-The [[http://​pytorch.org|installation of PyTorch]] is done via  
- 
-<​code>​ 
-conda install pytorch torchvision -c pytorch 
-</​code>​ 
- 
-=== Running the CIFAR-10 Tutorial tutorial via jupyter notebook === 
- 
-I did the [[http://​pytorch.org/​tutorials/​beginner/​blitz/​cifar10_tutorial.html|CIFAR-10 classifier tutorial]] via a [[http://​jupyter.org|jupyter notebook]]. Jupyter notebook is a webfrontend such that 
-the python code can be executed via a webbrowser. To install the jupyter framework I installed 
- 
-<​code>​ 
-conda install notebook 
-</​code>​ 
- 
-<​code>​ 
-cd 
-mkdir -p pytorch/​cifar10 
-cd pytorch/​cifar10 
-beckmanf@breakout:​~/​pytorch/​cifar10$ jupyter notebook --no-browser 
-[I 11:​59:​55.306 NotebookApp] The port 8888 is already in use, trying another port. 
-[I 11:​59:​55.405 NotebookApp] Serving notebooks from local directory: /​rz2home/​beckmanf/​pytorch/​cifar10 
-[I 11:​59:​55.405 NotebookApp] 0 active kernels 
-[I 11:​59:​55.405 NotebookApp] The Jupyter Notebook is running at: 
-[I 11:​59:​55.405 NotebookApp] http://​localhost:​8889/?​token=3d22f49d309a3e4fc0834dd58e3f7f36152d34e7a318aa3a 
-[I 11:​59:​55.405 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). 
-[C 11:​59:​55.405 NotebookApp] ​ 
-    ​ 
-    Copy/paste this URL into your browser when you connect for the first time, 
-    to login with a token: 
-        http://​localhost:​8889/?​token=3d22f49d309a3e4fc0834dd58e3f7f36152d34e7a318aa3a 
-</​code>​ 
- 
-In this example the jupyter web server is at port number 8889 on the breakout. The breakout is configured such that this port can NOT be reached from outside. Therefore you have to tunnel this port via ssh to your client machine. So do the following on your client with your account name. 
- 
-<​code>​ 
-FriedrichsMacBook:​~ fritz$ ssh -p 2222 -L 8889:​localhost:​8889 beckmanf@breakout.hs-augsburg.de 
-</​code>​ 
- 
-Now you can open the jupyter notebook via a local webbrowser on your client machine. The url is the one which was given above including the token. 
- 
-=== Running the imagenet training === 
- 
-The [[http://​image-net.org/​challenges/​LSVRC/​2012/​index|imagenet-12 dataset]] is a set of 1.3 million images which are hand labeled and categorized in 1000 categories. The data is available on the breakout at /​fast/​imagenet. The training is done with the pytorch examples. Install the pytorch examples from the git repository: 
- 
-<​code>​ 
-cd 
-cd pytorch 
-git clone https://​github.com/​pytorch/​examples.git 
-cd examples 
-cd imagenet 
-</​code>​ 
- 
-Now you can run the pytorch imagenet training with 
- 
-<​code>​ 
-python main.py --gpu=2 -a resnet18 /​fast/​imagenet 
-</​code>​ 
- 
-The training takes about 5 days on the breakout. Refer to "​Running long jobs" to see how you can run that long jobs on the breakout. 
- 
- 
-==== Bauingenieure - Photoscan ==== 
- 
-The photoscan software is installed under /​opt/​photoscan-pro. To run the software via the graphical user interface start the gui session via vncserver as described above. Then open a terminal and start photoscan via: 
- 
-=== Start the Software === 
- 
-<​code>​ 
-vglrun /​opt/​photoscan-pro/​photoscan.sh 
-</​code>​ 
- 
-=== License Activation === 
-The software is currently installed with root as owner. Therefore only root can update the software and the license. To update the license, do: 
- 
-<​code>​ 
-sudo /​opt/​photoscan-pro/​photoscan.sh --activate EGKKS-KRNPU-LRMLE-RJDTS-GE4SK 
-</​code>​ 
  
  • breakout.1545811050.txt.gz
  • Last modified: 2018/12/26 08:57
  • by beckmanf