当 CUDA_VISIBLE_DEVICES 不等于 0 时，扭矩作业找不到 GPU

Question

我遇到了一个奇怪的 GPU 扭矩分配问题。

我运行在一台配备两块 NVIDIA GTX Titan X GPU 的机器上运行 Torque 6.1.0。我正在使用 pbs_sched 进行日程安排。 nvidia-smi 静态输出如下：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0      On |                  N/A |
| 22%   40C    P8    15W / 250W |      0MiB / 12204MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   33C    P8    14W / 250W |      0MiB / 12207MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

我有一个简单的测试脚本来评估 GPU 分配，如下所示：

#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

deviceQuery 是 CUDA 附带的实用程序。当我从命令行运行它时，它正确地找到了两个 GPU。当我像这样从命令行限制到一台设备时...

CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery

...它也能正确找到一个或另一个 GPU。

当我使用 qsub 将 test.sh 提交到队列时，当没有其他作业运行ning 时，它再次正常工作。这是输出：

CUDA_VISIBLE_DEVICES: 0 
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"   CUDA Driver Version / Runtime Version          8.0 / 8.0   CUDA Capability Major/Minor version number:    5.2   Total amount of global memory:                 12204 MBytes (12796887040 bytes)   (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores   GPU Max Clock rate:                    1076 MHz (1.08 GHz)   Memory Clock rate:                             3505 Mhz   Memory Bus Width:                              384-bit   L2 Cache Size:                                 3145728 bytes   Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers   Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers   Total amount of constant memory:               65536 bytes   Total amount of shared memory per block:       49152 bytes   Total number of registers available per block: 65536   Warp size:                                     32   Maximum number of threads per multiprocessor:  2048   Maximum number of threads per block:           1024   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)   Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)   Maximum memory pitch:            2147483647 bytes   Texture alignment:                             512 bytes   Concurrent copy and kernel execution:          Yes with 2 copy engine(s)   Run time limit on kernels:                     No   Integrated GPU sharing Host Memory:            No   Support host page-locked memory mapping:       Yes   Alignment requirement for Surfaces:            Yes   Device has ECC support:                     Disabled   Device supports Unified Addressing (UVA):      Yes   Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0   Compute Mode:
     < Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS

但是，如果作业已经在 gpu0 上运行ning（即如果它被分配 CUDA_VISIBLE_DEVICES=1），则该作业找不到任何 GPU。输出：

CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

有人知道这是怎么回事吗？

Answer 1

我想我已经解决了我自己的问题，但不幸的是我同时尝试了两件事。我不想回去确认哪个解决了问题。它是以下之一：

在构建之前从 Torque 的配置脚本中删除 --enable-cgroups 选项。
运行 Torque 安装过程中的这些步骤：

打包

sh torque-package-server-linux-x86_64.sh --install

sh torque-package-mom-linux-x86_64.sh --install

sh torque-package-clients-linux-x86_64.sh --install

对于第二个选项，我知道这些步骤已正确记录在 Torque 安装说明中。但是，我有一个简单的设置，其中只有一个节点（计算节点和服务器是同一台机器）。我认为 'make install' 应该完成软件包安装为该单个节点所做的一切，但也许我错了。

当 CUDA_VISIBLE_DEVICES 不等于 0 时，扭矩作业找不到 GPU

Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

gpu

nvidia

pbs

torque