docker-compose 找不到 nvidia 驱动程序
docker-compose can't found nvidia dirver
我正在尝试 运行 clara train example, but when I execute the startClaraTrainNoteBooks.sh,容器找不到 nvidia 驱动程序。
我已经知道脚本执行docker-compose.yml。于是测试了下docker-compose
能否找到nvidia驱动:
services:
test:
image: nvidia/cuda:10.2-base
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
输出:
USER@test:~$ docker-compose up
WARNING: Found orphan containers (hp_nvsmi_1) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Starting hp_test_1 ... done
Attaching to hp_test_1
test_1 | Mon Jun 7 09:01:44 2021
test_1 | +-----------------------------------------------------------------------------+
test_1 | | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
test_1 | |-------------------------------+----------------------+----------------------+
test_1 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
test_1 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
test_1 | | | | MIG M. |
test_1 | |===============================+======================+======================|
test_1 | | 0 GeForce RTX 206... Off | 00000000:01:00.0 Off | N/A |
test_1 | | 0% 34C P8 17W / 215W | 100MiB / 7979MiB | 0% Default |
test_1 | | | | N/A |
test_1 | +-------------------------------+----------------------+----------------------+
test_1 |
test_1 | +-----------------------------------------------------------------------------+
test_1 | | Processes: |
test_1 | | GPU GI CI PID Type Process name GPU Memory |
test_1 | | ID ID Usage |
test_1 | |=============================================================================|
test_1 | +-----------------------------------------------------------------------------+
hp_test_1 exited with code 0
但是 startClaraTrainNoteBooks.sh
cna 找不到它。
root@claratrain:/claraDevDay# nvidia-smi
root@claratrain:/claraDevDay#
其实startDocker.sh可以找到驱动
root@c7c2d5597eb8:/claraDevDay# nvidia-smi
Mon Jun 7 09:11:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 206... Off | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 17W / 215W | 100MiB / 7979MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@c7c2d5597eb8:/claraDevDay#
我该怎么办?
docker-compose.yml
脚本需要像这样重写并工作:
# SPDX-License-Identifier: Apache-2.0
version: "3.8"
services:
claratrain:
container_name: claradevday-pt
hostname: claratrain
##### use vanilla clara train docker
#image: nvcr.io/nvidia/clara-train-sdk:v4.0
##### to build image with GPU dashboard inside jupyter lab
build:
context: ./dockerWGPUDashboardPlugin/ # Project root
dockerfile: ./Dockerfile # Relative to context
image: clara-train-nvdashboard:v4.0
depends_on:
- tritonserver
ports:
- "3030:8888" # Jupyter lab port
- "3031:5000" # AIAA port
ipc: host
volumes:
- ${TRAIN_DEV_DAY_ROOT}:/claraDevDay/
- /raid/users/aharouni/data:/data/
command: "jupyter lab /claraDevDay --ip 0.0.0.0 --allow-root --no-browser --config /claraDevDay/scripts/jupyter_notebook_config.py"
# command: tail -f /dev/null
# tty: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
# To specify certain GPU uncomment line below
#device_ids: ['0,3']
#############################################################
tritonserver:
image: nvcr.io/nvidia/tritonserver:21.02-py3
container_name: aiaa-triton
hostname: tritonserver
restart: unless-stopped
command: >
sh -c "chmod 777 /triton_models &&
/opt/tritonserver/bin/tritonserver \
--model-store /triton_models \
--model-control-mode="poll" \
--repository-poll-secs=5 \
--log-verbose ${TRITON_VERBOSE}"
volumes:
- ${TRAIN_DEV_DAY_ROOT}/AIAA/workspace/triton_models:/triton_models
# shm_size: 1gb
# ulimits:
# memlock: -1
# stack: 67108864
# logging:
# driver: json-file
我正在尝试 运行 clara train example, but when I execute the startClaraTrainNoteBooks.sh,容器找不到 nvidia 驱动程序。
我已经知道脚本执行docker-compose.yml。于是测试了下docker-compose
能否找到nvidia驱动:
services:
test:
image: nvidia/cuda:10.2-base
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
输出:
USER@test:~$ docker-compose up
WARNING: Found orphan containers (hp_nvsmi_1) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Starting hp_test_1 ... done
Attaching to hp_test_1
test_1 | Mon Jun 7 09:01:44 2021
test_1 | +-----------------------------------------------------------------------------+
test_1 | | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
test_1 | |-------------------------------+----------------------+----------------------+
test_1 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
test_1 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
test_1 | | | | MIG M. |
test_1 | |===============================+======================+======================|
test_1 | | 0 GeForce RTX 206... Off | 00000000:01:00.0 Off | N/A |
test_1 | | 0% 34C P8 17W / 215W | 100MiB / 7979MiB | 0% Default |
test_1 | | | | N/A |
test_1 | +-------------------------------+----------------------+----------------------+
test_1 |
test_1 | +-----------------------------------------------------------------------------+
test_1 | | Processes: |
test_1 | | GPU GI CI PID Type Process name GPU Memory |
test_1 | | ID ID Usage |
test_1 | |=============================================================================|
test_1 | +-----------------------------------------------------------------------------+
hp_test_1 exited with code 0
但是 startClaraTrainNoteBooks.sh
cna 找不到它。
root@claratrain:/claraDevDay# nvidia-smi
root@claratrain:/claraDevDay#
其实startDocker.sh可以找到驱动
root@c7c2d5597eb8:/claraDevDay# nvidia-smi
Mon Jun 7 09:11:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 206... Off | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 17W / 215W | 100MiB / 7979MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@c7c2d5597eb8:/claraDevDay#
我该怎么办?
docker-compose.yml
脚本需要像这样重写并工作:
# SPDX-License-Identifier: Apache-2.0
version: "3.8"
services:
claratrain:
container_name: claradevday-pt
hostname: claratrain
##### use vanilla clara train docker
#image: nvcr.io/nvidia/clara-train-sdk:v4.0
##### to build image with GPU dashboard inside jupyter lab
build:
context: ./dockerWGPUDashboardPlugin/ # Project root
dockerfile: ./Dockerfile # Relative to context
image: clara-train-nvdashboard:v4.0
depends_on:
- tritonserver
ports:
- "3030:8888" # Jupyter lab port
- "3031:5000" # AIAA port
ipc: host
volumes:
- ${TRAIN_DEV_DAY_ROOT}:/claraDevDay/
- /raid/users/aharouni/data:/data/
command: "jupyter lab /claraDevDay --ip 0.0.0.0 --allow-root --no-browser --config /claraDevDay/scripts/jupyter_notebook_config.py"
# command: tail -f /dev/null
# tty: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
# To specify certain GPU uncomment line below
#device_ids: ['0,3']
#############################################################
tritonserver:
image: nvcr.io/nvidia/tritonserver:21.02-py3
container_name: aiaa-triton
hostname: tritonserver
restart: unless-stopped
command: >
sh -c "chmod 777 /triton_models &&
/opt/tritonserver/bin/tritonserver \
--model-store /triton_models \
--model-control-mode="poll" \
--repository-poll-secs=5 \
--log-verbose ${TRITON_VERBOSE}"
volumes:
- ${TRAIN_DEV_DAY_ROOT}/AIAA/workspace/triton_models:/triton_models
# shm_size: 1gb
# ulimits:
# memlock: -1
# stack: 67108864
# logging:
# driver: json-file