Tensorflow:开始新会话时 TORQUE 和 GPU 出现问题:CUDA_ERROR_INVALID_DEVICE
Tensorflow: Problems with TORQUE and GPUs when starting new session: CUDA_ERROR_INVALID_DEVICE
我正在尝试使用带 GPU 的 Tensorflow v1.0.1 和 TORQUE v6.1.0 以及 MOAB 作为作业调度程序来解决集群上出现的问题。
执行的python脚本尝试启动新会话时出现错误:
[...]
with tf.Session() as sess:
[...]
错误信息:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Load Data...
input: (12956, 128, 128, 1)
output: (12956, 64, 64, 16)
Initiliaze training
Traceback (most recent call last):
File "[...]/train.py", line 154, in <module>
tf.app.run()
File "[...]/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "[...]/train.py", line 150, in main
training()
File "[...]/train.py", line 72, in training
with tf.Session() as sess:
File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1176, in __init__
super(Session, self).__init__(target, graph, config=config)
File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 552, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "[...]/python/3.5.1/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "[...]/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
为了重现该问题,我直接在离线 GPU 节点上执行了脚本(因此不涉及 TORQUE)并且没有抛出任何错误。因此我认为问题与TORQUE有关,但我还没有找到解决方案。
扭矩参数:
#PBS -l nodes=1:ppn=2:gpus=4:exclusive_process
#PBS -l mem=25gb
我在没有 exclusive_process
的情况下尝试了一次,但作业没有执行。我认为当涉及 GPU 时,我们的调度程序需要这个标志。
我想我通过将计算模式从 'exclusive_process' 更改为 'shared' 找到了获得工作 运行 的方法。
现在作业开始了,它似乎在计算一些东西。但是我问自己是不是四个GPU都用上了,因为nvidia-smi的输出。为什么所有 GPU 都在同一个进程上工作?
Fri May 26 13:41:33 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 45C P0 58W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 37C P0 70W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:84:00.0 Off | 0 |
| N/A 32C P0 59W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 58C P0 143W / 149W | 11000MiB / 11439MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11757 C python 10867MiB |
| 1 11757 C python 10869MiB |
| 2 11757 C python 10867MiB |
| 3 11757 C python 10996MiB |
+-----------------------------------------------------------------------------+
我正在尝试使用带 GPU 的 Tensorflow v1.0.1 和 TORQUE v6.1.0 以及 MOAB 作为作业调度程序来解决集群上出现的问题。
执行的python脚本尝试启动新会话时出现错误:
[...]
with tf.Session() as sess:
[...]
错误信息:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Load Data...
input: (12956, 128, 128, 1)
output: (12956, 64, 64, 16)
Initiliaze training
Traceback (most recent call last):
File "[...]/train.py", line 154, in <module>
tf.app.run()
File "[...]/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "[...]/train.py", line 150, in main
training()
File "[...]/train.py", line 72, in training
with tf.Session() as sess:
File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1176, in __init__
super(Session, self).__init__(target, graph, config=config)
File "[...]/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 552, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "[...]/python/3.5.1/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "[...]/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
为了重现该问题,我直接在离线 GPU 节点上执行了脚本(因此不涉及 TORQUE)并且没有抛出任何错误。因此我认为问题与TORQUE有关,但我还没有找到解决方案。
扭矩参数:
#PBS -l nodes=1:ppn=2:gpus=4:exclusive_process
#PBS -l mem=25gb
我在没有 exclusive_process
的情况下尝试了一次,但作业没有执行。我认为当涉及 GPU 时,我们的调度程序需要这个标志。
我想我通过将计算模式从 'exclusive_process' 更改为 'shared' 找到了获得工作 运行 的方法。
现在作业开始了,它似乎在计算一些东西。但是我问自己是不是四个GPU都用上了,因为nvidia-smi的输出。为什么所有 GPU 都在同一个进程上工作?
Fri May 26 13:41:33 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 45C P0 58W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 37C P0 70W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:84:00.0 Off | 0 |
| N/A 32C P0 59W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 58C P0 143W / 149W | 11000MiB / 11439MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11757 C python 10867MiB |
| 1 11757 C python 10869MiB |
| 2 11757 C python 10867MiB |
| 3 11757 C python 10996MiB |
+-----------------------------------------------------------------------------+