Installing NVIDIA DIGITS directly on your computer means that you'll:
- spend a considerable amount of time in installing all dependencies and building DIGITS itself
- pollute your machine with another application and its dependencies
To prevent this, we can run NVIDIA DIGITS Docker container. Let's check first whether docker is installed and its version :
$ docker --version
Docker version 20.10.3, build 48d30b5
For the reference, I was running the commands I listed below in this article on my Ubuntu 20.04:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
Ideally, we'd be running NVIDIA Digits on a machine with GPU(s). This would speed up training and inference but Digits can also work on a machine which has a CPU only.
I have GeForce GT 640 graphics card:
$ nvidia-smi -L
GPU 0: GeForce GT 640 (UUID: GPU-f2583df9-404d-2564-d332-e7878a94d087)
$ lspci
...
VGA compatible controller: NVIDIA Corporation GK107 [GeForce GT 640 OEM] (rev a1)
...
GK107 is a code name for GeForce GT 640 (GDDR5) (source: GeForce 600 series - Wikipedia) which, according to CUDA GPUs | NVIDIA Developer, has computing capability 3.5 (which is supported as it has to be >2.1 according to Installation Guide — NVIDIA Cloud Native Technologies documentation).
To test the local GPU we can run nvidia-smi application on the local host or in Docker image.
If we haven't installed CUDA or nvidia-smi locally, we can run nvidia-smi from NVIDIA CUDA Docker image:
$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Thu Feb 11 01:02:09 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GT 640 Off | 00000000:01:00.0 N/A | N/A |
| 40% 31C P8 N/A / N/A | 286MiB / 1992MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Let's now follow the instructions from DIGITS | NVIDIA NGC. We first need to download the image to our local host:
$ docker pull nvcr.io/nvidia/digits:20.12-tensorflow-py3
20.12-tensorflow-py3: Pulling from nvidia/digits
6a5697faee43: Pulling fs layer
ba13d3bc422b: Pulling fs layer
...
cec6045b0d0e: Pulling fs layer
cb4aa708e833: Waiting
235cfa23a5f4: Waiting
24781a3c82ea: Waiting
f7c7d47c1a97: Pull complete
...
b57dde2f2923: Pull complete
Digest: sha256:7542143bc2292fc48a3874786877815a5ca6a74a69366324aaf66914155cb5a7
Status: Downloaded newer image for nvcr.io/nvidia/digits:20.12-tensorflow-py3
nvcr.io/nvidia/digits:20.12-tensorflow-py3
Let's now run the container. docker run has --gpus option which instructs Docker to add GPU devices to container ('all' to pass all GPUs).
$ docker run --gpus all -it --rm nvcr.io/nvidia/digits:20.12-tensorflow-py3
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
I haven't installed NVIDIA Container Toolkit (nvidia-docker) which enable Docker containers accessing host's GPU. Installation Guide — NVIDIA Cloud Native Technologies documentation describes how to install it:
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker
$ nvidia-docker version
NVIDIA Docker: 2.5.0
Client: Docker Engine - Community
Version: 20.10.3
API version: 1.41
Go version: go1.13.15
Git commit: 48d30b5
Built: Fri Jan 29 14:33:21 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.3
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 46229ca
Built: Fri Jan 29 14:31:32 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
To be on the safe side, I also installed the latest NVIDIA driver.
$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
[sudo] password for bojan:
Thu Feb 11 01:02:09 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GT 640 Off | 00000000:01:00.0 N/A | N/A |
| 40% 31C P8 N/A / N/A | 286MiB / 1992MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This time running DIGITS container was successful. DIGITS 6.0 http server uses port 5000 by default and in this example it is mapped to host port 8888.
$ docker run --gpus all -it --rm -p 8888:5000 nvcr.io/nvidia/digits:20.12-tensorflow-py3
============
== DIGITS ==
============
NVIDIA Release 20.12 (build 17912121)
DIGITS Version 6.1.1
Container image Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
DIGITS Copyright (c) 2014-2019, NVIDIA CORPORATION. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: No supported GPU(s) detected to run this container
___ ___ ___ ___ _____ ___
| \_ _/ __|_ _|_ _/ __|
| |) | | (_ || | | | \__ \
|___/___\___|___| |_| |___/ 6.1.1
Caffe support disabled.
Reason: A valid Caffe installation was not found on your system.
cudaRuntimeGetVersion() failed with error #999
2021-02-11 16:23:54.454747: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/opt/digits/digits/pretrained_model/views.py:32: SyntaxWarning: "is" with a literal. Did you mean "=="?
if str(files['weights_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:38: SyntaxWarning: "is" with a literal. Did you mean "=="?
if str(files['model_def_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:54: SyntaxWarning: "is" with a literal. Did you mean "=="?
if str(files['weights_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:60: SyntaxWarning: "is" with a literal. Did you mean "=="?
if str(files['model_def_file'].filename) is '':
/opt/digits/digits/pretrained_model/views.py:169: SyntaxWarning: "is" with a literal. Did you mean "=="?
elif str(flask.request.form['job_name']) is '':
/opt/digits/digits/pretrained_model/views.py:177: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if str(flask.request.files['labels_file'].filename) is not '':
2021-02-11 16:23:56 [INFO ] Loaded 0 jobs.
If we now open a browser on the host and type http://localhost:8888 we'll be able to see DIGITS home page:
As DIGITS is a web-based application we don't need to run it in interactive mode (docker run -it) but can run it in a detached mode (docker run -d):
$ docker run \
--gpus all \
-d \
--name digits \
--rm \
-p 8888:5000 \
-v /home/bojan/dev/digits-demo/data:/data \
-v /home/bojan/dev/digits-demo/jobs:/workspace/jobs \ nvcr.io/nvidia/digits:20.12-tensorflow-py3
905f9a8c8e48bc87ae99117eed92b855d45c7d37695c0e94433bd18fab6bfaca
We can verify that DIGITS container is indeed running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
905f9a8c8e48 nvcr.io/nvidia/digits:20.12-tensorflow-py3 "/usr/local/bin/nvid…" About a minute ago Up About a minute 6006/tcp, 6064/tcp, 8888/tcp, 0.0.0.0:8888->5000/tcp digits
Why DIGITS doesn't recognize my GPU?
One thing didn't seem right to me though. In the upper right corner of the DIGITS home page should be a text which indicates how many GPUs are available. In my case, although I have one GPU, no GPUs were listed.
I tried first to check if GPU is indeed visible from the container:
$ docker exec -it digits bash
root@e58b860504a9:/workspace#
root@e58b860504a9:/workspace# nvidia-smi
Fri Feb 12 23:33:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GT 640 Off | 00000000:01:00.0 N/A | N/A |
| 40% 32C P8 N/A / N/A | 260MiB / 1992MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Graphics card was visible. DIGITS installation contains a Python script which is DIGITS Device Query (source code: python/9427/DIGITS/digits/device_query.py). When I tried to run it, I got an error:
root@e58b860504a9:/opt/digits/digits# python device_query.py
cudaRuntimeGetVersion() failed with error #999
No devices found.
cudaErrorUnknown = 999This indicates that an unknown internal error has occurred.
CUDA was installed fine:
root@6cd6c429f20c:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
On the host system I checked if loading the NVIDIA driver gave any errors (NVRM errors are internal to the nvidia kernel module):
$ sudo dmesg |grep NVRM
[sudo] password for bojan:
[ 2.283911] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020
[ 8654.742795] NVRM: GPU at PCI:0000:01:00: GPU-f2583df9-404d-2564-d332-e7878a94d087
[ 8654.742800] NVRM: Xid (PCI:0000:01:00): 31, pid=577, Ch 00000002, intr 10000000. MMU Fault: ENGINE HOST4 HUBCLIENT_HOST faulted @ 0x1_01160000. Fault is of type FAULT_INFO_TYPE_UNSUPPORTED
I could not deduct anything useful from here but by reading DIGITS release notes I finally found the reason why DIGITS won't recognize my GPU - it is too old!
Installation Guide — NVIDIA Cloud Native Technologies documentation specifies compute capability requirements for NVIDIA Container Toolkit but compute capability requirements for DIGITS Docker image are specified for each image release. For digits:20.12 DIGITS Release Notes :: NVIDIA Deep Learning DIGITS Documentation states the following:
Release 20.12 supports CUDA compute capability 6.0 and higher.
My GPU has compute capability 3.5 and so it does not meet that requirement.
The NVIDIA Control Panel lets you manage the settings for your system's installed NVIDIA utilities and graphics drivers. You may utilise an NVIDIA card without NVIDIA Control Panel running. You can only adjust some graphic optimisation parameters in the NVIDIA Control Panel, such as customising resolutions and 3D settings, etc.
ReplyDeleteNvidia control panel