Sunday 14 March 2021

Running NVIDIA DIGITS Docker container on Ubuntu

Installing NVIDIA DIGITS directly on your computer means that you'll:
  • spend a considerable amount of time in installing all dependencies and building DIGITS itself
  • pollute your machine with another application and its dependencies
To prevent this, we can run NVIDIA DIGITS Docker container. Let's check first whether docker is installed and its version :

$ docker --version
Docker version 20.10.3, build 48d30b5

For the reference, I was running the commands I listed below in this article on my Ubuntu 20.04:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal

Ideally, we'd be running NVIDIA Digits on a machine with GPU(s). This would speed up training and inference but Digits can also work on a machine which has a CPU only. 

I have GeForce GT 640 graphics card:

$ nvidia-smi -L
GPU 0: GeForce GT 640 (UUID: GPU-f2583df9-404d-2564-d332-e7878a94d087)

$ lspci
...
VGA compatible controller: NVIDIA Corporation GK107 [GeForce GT 640 OEM] (rev a1)
...

GK107 is a code name for GeForce GT 640 (GDDR5) (source: GeForce 600 series - Wikipedia) which, according to CUDA GPUs | NVIDIA Developer, has computing capability 3.5 (which is supported as it has to be >2.1 according to Installation Guide — NVIDIA Cloud Native Technologies documentation).

To test the local GPU we can run nvidia-smi application on the local host or in Docker image.

If we haven't installed CUDA or nvidia-smi locally, we can run nvidia-smi from NVIDIA CUDA Docker image:

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Thu Feb 11 01:02:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 640      Off  | 00000000:01:00.0 N/A |                  N/A |
| 40%   31C    P8    N/A /  N/A |    286MiB /  1992MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


Let's now follow the instructions from DIGITS | NVIDIA NGC. We first need to download the image to our local host:

$ docker pull nvcr.io/nvidia/digits:20.12-tensorflow-py3
20.12-tensorflow-py3: Pulling from nvidia/digits
6a5697faee43: Pulling fs layer 
ba13d3bc422b: Pulling fs layer 
...
cec6045b0d0e: Pulling fs layer 
cb4aa708e833: Waiting 
235cfa23a5f4: Waiting 
24781a3c82ea: Waiting 
f7c7d47c1a97: Pull complete 
...
b57dde2f2923: Pull complete 
Digest: sha256:7542143bc2292fc48a3874786877815a5ca6a74a69366324aaf66914155cb5a7
Status: Downloaded newer image for nvcr.io/nvidia/digits:20.12-tensorflow-py3
nvcr.io/nvidia/digits:20.12-tensorflow-py3

Let's now run the container. docker run has --gpus option which instructs Docker to add GPU devices to container ('all' to pass all GPUs).

$ docker run --gpus all -it --rm nvcr.io/nvidia/digits:20.12-tensorflow-py3
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

I haven't installed NVIDIA Container Toolkit (nvidia-docker) which enable Docker containers accessing host's GPU. Installation Guide — NVIDIA Cloud Native Technologies documentation describes how to install it:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker


nvidia-docker version 
NVIDIA Docker: 2.5.0
Client: Docker Engine - Community
 Version:           20.10.3
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        48d30b5
 Built:             Fri Jan 29 14:33:21 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.3
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       46229ca
  Built:            Fri Jan 29 14:31:32 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0


To be on the safe side, I also installed the latest NVIDIA driver.

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
[sudo] password for bojan: 
Thu Feb 11 01:02:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 640      Off  | 00000000:01:00.0 N/A |                  N/A |
| 40%   31C    P8    N/A /  N/A |    286MiB /  1992MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


This time running DIGITS container was successful. DIGITS 6.0 http server uses port 5000 by default and in this example it is mapped to host port 8888.

$ docker run --gpus all -it --rm -p 8888:5000 nvcr.io/nvidia/digits:20.12-tensorflow-py3

============
== DIGITS ==
============

NVIDIA Release 20.12 (build 17912121)
DIGITS Version 6.1.1

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
DIGITS Copyright (c) 2014-2019, NVIDIA CORPORATION. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: No supported GPU(s) detected to run this container

  ___ ___ ___ ___ _____ ___
 |   \_ _/ __|_ _|_   _/ __|
 | |) | | (_ || |  | | \__ \
 |___/___\___|___| |_| |___/ 6.1.1

Caffe support disabled.
Reason: A valid Caffe installation was not found on your system.
cudaRuntimeGetVersion() failed with error #999
2021-02-11 16:23:54.454747: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/opt/digits/digits/pretrained_model/views.py:32: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if str(files['weights_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:38: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if str(files['model_def_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:54: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if str(files['weights_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:60: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if str(files['model_def_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:169: SyntaxWarning: "is" with a literal. Did you mean "=="?
  elif str(flask.request.form['job_name']) is '':
/opt/digits/digits/pretrained_model/views.py:177: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if str(flask.request.files['labels_file'].filename) is not '':
2021-02-11 16:23:56 [INFO ] Loaded 0 jobs.


If we now open a browser on the host and type http://localhost:8888 we'll be able to see DIGITS home page:



As DIGITS is a web-based application we don't need to run it in interactive mode (docker run -it) but can run it in a detached mode (docker run -d):

$ docker run \
--gpus all \
-d \
--name digits \
--rm \
-p 8888:5000 \
-v /home/bojan/dev/digits-demo/data:/data \
-v /home/bojan/dev/digits-demo/jobs:/workspace/jobs \ nvcr.io/nvidia/digits:20.12-tensorflow-py3

905f9a8c8e48bc87ae99117eed92b855d45c7d37695c0e94433bd18fab6bfaca

We can verify that DIGITS container is indeed running:

$ docker ps 
CONTAINER ID   IMAGE                                        COMMAND                  CREATED              STATUS              PORTS                                                  NAMES
905f9a8c8e48   nvcr.io/nvidia/digits:20.12-tensorflow-py3   "/usr/local/bin/nvid…"   About a minute ago   Up About a minute   6006/tcp, 6064/tcp, 8888/tcp, 0.0.0.0:8888->5000/tcp   digits


Why DIGITS doesn't recognize my GPU?



One thing didn't seem right to me though. In the upper right corner of the DIGITS home page should be a text which indicates how many GPUs are available. In my case, although I have one GPU, no GPUs were listed. 




I tried first to check if GPU is indeed visible from the container:

$ docker exec -it digits bash
root@e58b860504a9:/workspace# 

root@e58b860504a9:/workspace# nvidia-smi
Fri Feb 12 23:33:17 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 640      Off  | 00000000:01:00.0 N/A |                  N/A |
| 40%   32C    P8    N/A /  N/A |    260MiB /  1992MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Graphics card was visible. DIGITS installation contains a Python script which is DIGITS Device Query (source code: python/9427/DIGITS/digits/device_query.py). When I tried to run it, I got an error:

root@e58b860504a9:/opt/digits/digits# python device_query.py 
cudaRuntimeGetVersion() failed with error #999
No devices found.


cudaErrorUnknown = 999
This indicates that an unknown internal error has occurred.
CUDA was installed fine:

root@6cd6c429f20c:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0


On the host system I checked if loading the NVIDIA driver gave any errors (NVRM errors are internal to the nvidia kernel module):

$ sudo dmesg |grep NVRM
[sudo] password for bojan: 
[    2.283911] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.32.03  Sun Dec 27 19:00:34 UTC 2020
[ 8654.742795] NVRM: GPU at PCI:0000:01:00: GPU-f2583df9-404d-2564-d332-e7878a94d087
[ 8654.742800] NVRM: Xid (PCI:0000:01:00): 31, pid=577, Ch 00000002, intr 10000000. MMU Fault: ENGINE HOST4 HUBCLIENT_HOST faulted @ 0x1_01160000. Fault is of type FAULT_INFO_TYPE_UNSUPPORTED


I could not deduct anything useful from here but by reading DIGITS release notes I finally found the reason why DIGITS won't recognize my GPU - it is too old!

Installation Guide — NVIDIA Cloud Native Technologies documentation specifies compute capability requirements for NVIDIA Container Toolkit but compute capability requirements for DIGITS Docker image are specified for each image release. For digits:20.12 DIGITS Release Notes :: NVIDIA Deep Learning DIGITS Documentation states the following:

Release 20.12 supports CUDA compute capability 6.0 and higher.

My GPU has compute capability 3.5 and so it does not meet that requirement.


References







No comments: