Differences

This shows you the differences between two versions of the page.

--- breakout [2018/12/26 08:57]
beckmanf [PyTorch]
+++ breakout [2018/12/27 09:55]
beckmanf [PyTorch] long jobs
@@ Line 134: / Line 134: @@
 </code>
-überprüft werden. Im Beispiel oben kann man sehen, dass es
+überprüft werden. Im Beispiel oben kann man sehen:
-  * Vier GeForce GTX 1080 Grafikkarten gibt
+  * Es gibt vier GeForce GTX 1080 Grafikkarten
   * Grafikkarte "3" ist gerade in Betrieb - der Lüfter läuft auf 90% und die Temperatur beträgt 76 GradC
   * Der Prozess mit Process ID 14538 "python" läuft auf Karte 3. Der Speicher ist  mit 7323 MiB fast voll.
-==== Torch ====
+==== Running long jobs ====
-Alle debian Pakete für die Installation von [[http://torch.ch|Torch]] sind auf der breakout installiert. Torch selbst wird nicht über die Debian Paketinstallation installiert, sondern im Homeverzeichnis direkt aus git. Im Beispiel wird eine Version ausgecheckt, die funktioniert hat. Der Schritt install-deps.sh wird übersprungen, da dort mit sudo Pakete installiert werden. Diese Pakete kann man als normaler user aufgrund der sudo Rechte nicht installieren und sie sind auf der breakout auch schon installiert.
+=== tmux - Keep a session running even when you logout ===
-<code>
+With tmux you can keep a session running even when you logout. You can later login again and the session is still there. Create a new session:
-cd
-git clone https://github.com/torch/distro.git ~/torch --recursive
-git checkout efb9226e924d69513eea28f5f701cb5f5ca
-cd torch
-TORCH_LUA_VERSION=LUA52 ./install.sh
-source "$HOME/torch/install/bin/torch-activate"
-</code>
-Now add to .profile
 <code>
-# NVidia cuDNN library
+tmux new-session -s fredo
-if [ -f "/home/fritz/cuda/cudnn/cuda/lib64/libcudnn.so.6" ]; then
-  export CUDNN_PATH="/home/fritz/cuda/cudnn/cuda/lib64/libcudnn.so.6"
-fi
-# Torch environment settings
-if [ -f "$HOME/torch/install/bin/torch-activate" ]; then
-  source "$HOME/torch/install/bin/torch-activate"
-fi
 </code>
-Als Beispiel kann man [[http://torch.ch/blog/2015/07/30/cifar.html]] ausprobieren. Dort werden 50000 Bilder aus dem [[https://www.cs.toronto.edu/~kriz/cifar.html|CIFAR-10]] Benchmark klassifiziert.
+Now you can start a program. You can leave the tmux session (and the program) running when you type CTRL-b d. This will detach you from the tmux session. Then you can logout from you ssh session and keep everything running on the breakout. Then you can login to breakout via ssh again. You can reattach to tmux with
 <code>
-cd
+tmux attach-session -t fredo
-git clone https://github.com/szagoruyko/cifar.torch.git
-cd cifar.torch
-OMP_NUM_THREADS=2 th -i provider.lua
-# Opens torch shell - inside th:
-provider = Provider()
-provider:normalize()
-torch.save('provider.t7',provider)
-exit
-# Now back on shell
-CUDA_VISIBLE_DEVICES=0 th train.lua --model=vgg_bn_drop -s logs/vgg
 </code>
-The previous training uses the cuda compiled torch neural network models. NVidia provides specially crafted cuDNN models which are faster. To use these models:
+You should see the output from your running program.
-<code>
+=== kerberos - keep your file system alive ===
-CUDA_VISIBLE_DEVICES=0 th train.lua --model=vgg_bn_drop --backend=cudnn -s logs/cudnn
-</code>
-The network can also be trained without cuda/gpu support:
+When you login to the breakout via your RZ account, then your home directory is mounted on the breakout from the RZ file server via nfs. When you logout from the breakout, then your home directory is unmounted after 5 minutes if you have no job still running. If you have a job running, e.g. via tmux or a job in the background then your home directory remains mounted.
-<code>
+If you leave a job running for more than about 10 hours you get errors when you try to access files in your home directory. The reason is that the mounting process requires an authentication which is done via the kerberos service. When you login to the breakout with your password, then you automagically receive a kerberos ticket which is derived from the login credentials. This is required by the automounter of your home directory - without a kerberos ticket the nfs server does not allow the access to your files. When I run the pytorch example [[#Running the imagenet training]], then this takes about 5 days. After approximately 10 hours runtime I receive the following bus error message
-OMP_NUM_THREADS=16 th train.lua --model=vgg_bn_drop --type=float -s logs/cpu
-</code>
-==== Docker ====
-Mit Docker können zusätzliche Softwarepakete laufen ohne die Basisinstallation zu ändern. Vorraussetzung
-  * Ihr Account muss Mitglied der Gruppe "docker" sein
-Testen Sie ob Sie Mitglied der Gruppe docker sind mit
 <code>
-groups
+Epoch: [12][4980/5005]  Time 0.523 (0.524)      Data 0.000 (0.034)      Loss 2.5527 (2.5143)    Acc@1 44.922 (44.781)   Acc@5 69.922 (69.733)
+Epoch: [12][4990/5005]  Time 0.525 (0.524)      Data 0.000 (0.034)      Loss 2.7477 (2.5144)    Acc@1 44.141 (44.778)   Acc@5 66.016 (69.732)
+Epoch: [12][5000/5005]  Time 0.520 (0.524)      Data 0.000 (0.034)      Loss 2.3334 (2.5144)    Acc@1 46.094 (44.776)   Acc@5 70.312 (69.730)
+Test: [0/196]   Time 3.587 (3.587)      Loss 1.6937 (1.6937)    Acc@1 58.203 (58.203)   Acc@5 86.328 (86.328)
+Test: [10/196]  Time 0.159 (0.814)      Loss 2.3972 (2.0702)    Acc@1 39.062 (51.598)   Acc@5 75.391 (77.131)
+...
+Test: [170/196] Time 2.123 (0.635)      Loss 1.9238 (2.3964)    Acc@1 46.094 (45.463)   Acc@5 81.641 (72.149)
+Test: [180/196] Time 0.159 (0.630)      Loss 2.1114 (2.4070)    Acc@1 44.531 (45.254)   Acc@5 78.125 (71.996)
+Test: [190/196] Time 1.742 (0.633)      Loss 1.7933 (2.3935)    Acc@1 53.516 (45.492)   Acc@5 87.891 (72.215)
+ * Acc@1 45.864 Acc@5 72.442
+Traceback (most recent call last):
+  File "main.py", line 398, in <module>
+  File "main.py", line 113, in main
+...
+  File "/rz2home/beckmanf/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line 141, in _with_file_like
+PermissionError: [Errno 13] Permission denied: 'checkpoint.pth.tar'
+Bus-Zugriffsfehler
+beckmanf@breakout:~/pytorch/examples/imagenet$
 </code>
-Wenn Sie nicht Mitglied der Gruppe docker sind, dann funktionieren die folgenden Aktionen nicht. Bitte beachten Sie, dass Aktionen unter Docker sicherheitsrelevant sind. Durch das Mounten von Verzeichnissen mit der -v Option können auch Dateien im Host verändert werden, die unter root Rechten stehen.
+The reason for this bus error is that the pytorch program tries to write the file "checkpoint.pth.tar" to the home directory but the home directory cannot be accessed because of the kerberos ticket expired.
-=== Einfacher Test ===
+You can check the status of your current kerberos ticket with "klist".
-siehe: [[https://docs.docker.com/engine/getstarted/step_one/#step-3-verify-your-installation]]
 <code>
-docker run hello-world
+beckmanf@breakout:~$ klist
-</code>
+Ticket cache: FILE:/tmp/krb5cc_12487_ssddef
+Default principal: beckmanf@RZ.HS-AUGSBURG.DE
-=== NVidia Digits ===
+Valid starting       Expires              Service principal
+.12.2018 08:28:43  27.12.2018 18:28:43  krbtgt/RZ.HS-AUGSBURG.DE@RZ.HS-AUGSBURG.DE
-siehe: [[https://github.com/NVIDIA/nvidia-docker/wiki/DIGITS]]
+	renew until 28.12.2018 08:28:37
-<code>
-nvidia-docker run --name digits -d -P nvidia/digits
 </code>
-  * Option -d will run the docker image as daemon.
+The kerberos ticket lifetime is 10h and the renew time is 24h. So after 18:28:43 you cannot access your home directory anymore. You can apply for a new ticket with longer lifetime and a longer renew time with "kinit".
-  * Option -P will assign the used ports inside docker to random ports on the host.
-To check which ports are assigned and which containers are running:
 <code>
-docker ps
+beckmanf@breakout:~$ kinit -l 2d -r 7d
+Password for beckmanf@RZ.HS-AUGSBURG.DE:
 </code>
-In my example it looks like this:
+In the example above you apply for a ticket lifetime of 2 days and a renew time of 7 days. You can check the result with klist again.
 <code>
-fritz@breakout:~/docker$ docker ps
+beckmanf@breakout:~$ klist
-CONTAINER ID        IMAGE               COMMAND              CREATED             STATUS              PORTS                     NAMES
+Ticket cache: FILE:/tmp/krb5cc_12487_ssddef
-f9942fca476a        nvidia/digits       "python -m digits"   32 minutes ago      Up 3 seconds        0.0.0.0:32772->5000/tcp   digits
+Default principal: beckmanf@RZ.HS-AUGSBURG.DE
-fritz@breakout:~/docker$
-</code>
-The section "PORTS" shows that port 5000 from the docker container is mapped to port 32772 on the host. Now you can run a web browser with "http://breakout.hs-augsburg.de:32772" and see the NVidia Digits web interface.
+Valid starting       Expires              Service principal
+.12.2018 08:30:09  27.12.2018 18:30:09  krbtgt/RZ.HS-AUGSBURG.DE@RZ.HS-AUGSBURG.DE
-To stop NVidia Digits run
+	renew until 03.01.2019 08:30:05
-<code>
-docker stop digits
-docker rm digits
 </code>
-==== Tensorflow ====
+The kerberos ticket lifetime is still only 10h but the renew time is now seven days.
-=== With Python 2 ===
+== Renew a kerberos ticket ==
-Tensorflow version 1.4 supports Cuda 8.0 while all following versions require Cuda 9. The supported tensorflow version on this machine is 1.4. The recommended way to install tensorflow is "virtualenv".
-[[https://www.tensorflow.org/versions/r1.4/install/]]
+To get a new kerberos ticket you have to provide your password. But you can renew your ticket and extend the lifetime without a password until the maximum renew time expires. You must have a valid non-expired ticket when you start the renew process. In the example above you would have to do the renew until 18:30:09. You can renew with "kinit -R". You do not need a password to do that.
-Change your .profile and add the following
+== Start a job with automatic kerberos ticket renew ==
-<code>
+You can do the ticket renew process automatically. When you start a job with "krenew", then your existing kerberos ticket will be copied to a new ticket cache location and the renew process is automatically done until the renew time expires or the job is done. The ticket cache is copied because the kerberos cache that you received at login (here: /tmp/krb5cc_12487_ssddef) will be deleted at logout. To start the example from pytorch imagenet training, this would be done like this:
-# nvidia cuDNN library
-LD_LIBRARY_PATH="/usr/local/cuda/lib64:/home/fritz/cuda/cudnn/cuda/lib64:$LD_LIBRARY_PATH"
-</code>
-to make the cuda and cudnn library accessible. Logout and login. Tensorflow 1.4 requires cuda 8.0 and cudnn 6.0. This machine uses python 2.7.
-Install tensorflow:
 <code>
-virtualenv --system-site-packages ~/tensorflow
+krenew python -- main.py --gpu=2 -a resnet18 /fast/imagenet
-source ~/tensorflow/bin/activate
-pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp27-none-linux_x86_64.whl
 </code>
-Then [[https://www.tensorflow.org/versions/r1.4/install/install_linux#ValidateYourInstallation|validate]] that the installation worked.
+If you do this inside a tmux session, then you can detach and logout. The job will run for up to seven days. When you login later you can check the status of the jobs kerberos ticket again with klist. You have to provide the filename of the jobs ticket cache.
-=== With Python 3 ===
-Alternatively, you can also use Tensorflow with Python 3 on the server. Similar to the python2 version described above, only TensorFlow 1.4 is supported, but cuDNN 7.0 is used. Just add the following code to your ~/.profile
 <code>
-if [ -d "/fast/usr/bin" ] ; then
+klist /tmp/krb5cc_12487_ftXjk0
-    PATH="/fast/usr/bin:$PATH"
-fi
-if [ -d "/fast/usr/local/cuda-8.0/lib64" ] ; then
-    export LD_LIBRARY_PATH="/fast/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH"
-fi
 </code>
-Once you reconnected to the server, you are ready to use python3 with TensorFlow.
+In my example the new cache name from krenew was /tmp/krb5cc_12487_ftXjk0.
+== Login via Public Key Authentication ==
+When you login via Public Key Authentication, then you do not receive a new kerberos ticket. If you do not have a valid kerberos ticket, then you cannot access "$HOME/.ssh/authorized_keys" and you are falling back to default password login and receive a new kerberos ticket. If you did the login via Public Key, then your "klist" will not show any kerberos ticket because that is active from some other login session. However you can still run "kinit" and receive a new kerberos ticket. That will be stored in the default kerberos ticket cache location at "/tmp/krb5cc_<uid>".
 ==== PyTorch ====
@@ Line 384: / Line 334: @@
 </code>
-The training takes about 5 days on the breakout. Refer to "Running long jobs" to see how you can run that long jobs on the breakout.
+The training takes about 5 days on the breakout. Refer to [[#Running long jobs]] to see how you can run that long jobs on the breakout.
 ==== Bauingenieure - Photoscan ====
@@ Line 403: / Line 352: @@
 sudo /opt/photoscan-pro/photoscan.sh --activate EGKKS-KRNPU-LRMLE-RJDTS-GE4SK
 </code>
+==== Torch ====
+Alle debian Pakete für die Installation von [[http://torch.ch|Torch]] sind auf der breakout installiert. Torch selbst wird nicht über die Debian Paketinstallation installiert, sondern im Homeverzeichnis direkt aus git. Im Beispiel wird eine Version ausgecheckt, die funktioniert hat. Der Schritt install-deps.sh wird übersprungen, da dort mit sudo Pakete installiert werden. Diese Pakete kann man als normaler user aufgrund der sudo Rechte nicht installieren und sie sind auf der breakout auch schon installiert.
+<code>
+cd
+git clone https://github.com/torch/distro.git ~/torch --recursive
+git checkout efb9226e924d69513eea28f5f701cb5f5ca
+cd torch
+TORCH_LUA_VERSION=LUA52 ./install.sh
+source "$HOME/torch/install/bin/torch-activate"
+</code>
+Now add to .profile
+<code>
+# NVidia cuDNN library
+if [ -f "/home/fritz/cuda/cudnn/cuda/lib64/libcudnn.so.6" ]; then
+  export CUDNN_PATH="/home/fritz/cuda/cudnn/cuda/lib64/libcudnn.so.6"
+fi
+# Torch environment settings
+if [ -f "$HOME/torch/install/bin/torch-activate" ]; then
+  source "$HOME/torch/install/bin/torch-activate"
+fi
+</code>
+Als Beispiel kann man [[http://torch.ch/blog/2015/07/30/cifar.html]] ausprobieren. Dort werden 50000 Bilder aus dem [[https://www.cs.toronto.edu/~kriz/cifar.html|CIFAR-10]] Benchmark klassifiziert.
+<code>
+cd
+git clone https://github.com/szagoruyko/cifar.torch.git
+cd cifar.torch
+OMP_NUM_THREADS=2 th -i provider.lua
+# Opens torch shell - inside th:
+provider = Provider()
+provider:normalize()
+torch.save('provider.t7',provider)
+exit
+# Now back on shell
+CUDA_VISIBLE_DEVICES=0 th train.lua --model=vgg_bn_drop -s logs/vgg
+</code>
+The previous training uses the cuda compiled torch neural network models. NVidia provides specially crafted cuDNN models which are faster. To use these models:
+<code>
+CUDA_VISIBLE_DEVICES=0 th train.lua --model=vgg_bn_drop --backend=cudnn -s logs/cudnn
+</code>
+The network can also be trained without cuda/gpu support:
+<code>
+OMP_NUM_THREADS=16 th train.lua --model=vgg_bn_drop --type=float -s logs/cpu
+</code>
+==== Docker ====
+Mit Docker können zusätzliche Softwarepakete laufen ohne die Basisinstallation zu ändern. Vorraussetzung
+  * Ihr Account muss Mitglied der Gruppe "docker" sein
+Testen Sie ob Sie Mitglied der Gruppe docker sind mit
+<code>
+groups
+</code>
+Wenn Sie nicht Mitglied der Gruppe docker sind, dann funktionieren die folgenden Aktionen nicht. Bitte beachten Sie, dass Aktionen unter Docker sicherheitsrelevant sind. Durch das Mounten von Verzeichnissen mit der -v Option können auch Dateien im Host verändert werden, die unter root Rechten stehen.
+=== Einfacher Test ===
+siehe: [[https://docs.docker.com/engine/getstarted/step_one/#step-3-verify-your-installation]]
+<code>
+docker run hello-world
+</code>
+=== NVidia Digits ===
+siehe: [[https://github.com/NVIDIA/nvidia-docker/wiki/DIGITS]]
+<code>
+nvidia-docker run --name digits -d -P nvidia/digits
+</code>
+  * Option -d will run the docker image as daemon.
+  * Option -P will assign the used ports inside docker to random ports on the host.
+To check which ports are assigned and which containers are running:
+<code>
+docker ps
+</code>
+In my example it looks like this:
+<code>
+fritz@breakout:~/docker$ docker ps
+CONTAINER ID        IMAGE               COMMAND              CREATED             STATUS              PORTS                     NAMES
+f9942fca476a        nvidia/digits       "python -m digits"   32 minutes ago      Up 3 seconds        0.0.0.0:32772->5000/tcp   digits
+fritz@breakout:~/docker$
+</code>
+The section "PORTS" shows that port 5000 from the docker container is mapped to port 32772 on the host. Now you can run a web browser with "http://breakout.hs-augsburg.de:32772" and see the NVidia Digits web interface.
+To stop NVidia Digits run
+<code>
+docker stop digits
+docker rm digits
+</code>
+==== Tensorflow ====
+=== With Python 2 ===
+Tensorflow version 1.4 supports Cuda 8.0 while all following versions require Cuda 9. The supported tensorflow version on this machine is 1.4. The recommended way to install tensorflow is "virtualenv".
+[[https://www.tensorflow.org/versions/r1.4/install/]]
+Change your .profile and add the following
+<code>
+# nvidia cuDNN library
+LD_LIBRARY_PATH="/usr/local/cuda/lib64:/home/fritz/cuda/cudnn/cuda/lib64:$LD_LIBRARY_PATH"
+</code>
+to make the cuda and cudnn library accessible. Logout and login. Tensorflow 1.4 requires cuda 8.0 and cudnn 6.0. This machine uses python 2.7.
+Install tensorflow:
+<code>
+virtualenv --system-site-packages ~/tensorflow
+source ~/tensorflow/bin/activate
+pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp27-none-linux_x86_64.whl
+</code>
+Then [[https://www.tensorflow.org/versions/r1.4/install/install_linux#ValidateYourInstallation|validate]] that the installation worked.
+=== With Python 3 ===
+Alternatively, you can also use Tensorflow with Python 3 on the server. Similar to the python2 version described above, only TensorFlow 1.4 is supported, but cuDNN 7.0 is used. Just add the following code to your ~/.profile
+<code>
+if [ -d "/fast/usr/bin" ] ; then
+    PATH="/fast/usr/bin:$PATH"
+fi
+if [ -d "/fast/usr/local/cuda-8.0/lib64" ] ; then
+    export LD_LIBRARY_PATH="/fast/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH"
+fi
+</code>
+Once you reconnected to the server, you are ready to use python3 with TensorFlow.