无法使用 GPU 在使用 P100-NC6s-V2 计算的 Azure 机器学习服务中训练 NN 模型。因 CUDA 错误而失败
Unable to use GPU to train a NN model in azure machine learning service using P100-NC6s-V2 compute. Fails wth CUDA error
我最近开始使用 Azure for ML,并尝试使用机器学习服务工作区。
我已经设置了一个工作区,并将计算集设置为 NC6s-V2 机器,因为我需要在 GPU 上使用图像训练神经网络。
问题是训练仍然在 CPU 上进行——日志显示它无法找到 CUDA。这是 运行 我的脚本时的警告日志。
任何线索如何解决这个问题?
我还在估算器的 conda 包选项中明确提到了 tensorflow-gpu 包。
这是我的估算器代码,
script_params = {
'--input_data_folder': ds.path('dataset').as_mount(),
'--zip_file_name': 'train.zip',
'--run_mode': 'train'
}
est = Estimator(source_directory='./scripts',
script_params=script_params,
compute_target=compute_target,
entry_script='main.py',
conda_packages=['scikit-image', 'keras', 'tqdm', 'pillow', 'matplotlib', 'scipy', 'tensorflow-gpu']
)
run = exp.submit(config=est)
run.wait_for_completion(show_output=True)
计算目标是根据 github 上的示例代码制作的:
compute_name = "P100-NC6s-V2"
compute_min_nodes = 0
compute_max_nodes = 4
vm_size = "STANDARD_NC6S_V2"
if compute_name in ws.compute_targets:
compute_target = ws.compute_targets[compute_name]
if compute_target and type(compute_target) is AmlCompute:
print('found compute target. just use it. ' + compute_name)
else:
print('creating a new compute target...')
provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
min_nodes=compute_min_nodes,
max_nodes=compute_max_nodes)
# create the cluster
compute_target = ComputeTarget.create(
ws, compute_name, provisioning_config)
# can poll for a minimum number of nodes and for a specific timeout.
# if no min node count is provided it will use the scale settings for the cluster
compute_target.wait_for_completion(
show_output=True, min_node_count=None, timeout_in_minutes=20)
# For a more detailed view of current AmlCompute status, use get_status()
print(compute_target.get_status().serialize())
这是使用GPU失败的警告:
2019-08-12 14:50:16.961247: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a7ce570830 executing computations on platform Host. Devices:
2019-08-12 14:50:16.961278: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-08-12 14:50:16.971025: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/azureml-envs/azureml_5fdf05c5671519f307e0f43128b8610e/lib:
2019-08-12 14:50:16.971054: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2019-08-12 14:50:16.971081: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 4bd815dfb0e74e3da901861a4746184f000000
2019-08-12 14:50:16.971089: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 4bd815dfb0e74e3da901861a4746184f000000
2019-08-12 14:50:16.971164: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2019-08-12 14:50:16.971202: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.40.4
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
2019-08-12 14:50:16.973301: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
根据日志,它目前正在使用 CPU。有什么线索可以解决这里的问题吗?
错误的第 3 行似乎表明您尚未在 GPU 上安装 运行 神经网络所需的所有 CUDA 库。确保安装了所有 CUDA 依赖项。如果您不确定,请参阅此堆栈溢出问题:
欢迎来到 SO!
您可以将 Tensorflow Estimator 与 Keras 和其他位于顶部的库一起使用,而不是基础 Estimator。这样您就不必担心设置和配置 GPU 库,因为 Tensorflow Estimator 使用预先配置了 GPU 库的 Docker 图像。
在此处查看文档:
API Reference 您可以使用 conda_packages
参数指定额外的库。同时设置参数 use_gpu = True
.
之前和你有同样的问题,我通过
解决了
env = Environment.from_pip_requirements(
name="hello",
file_path=f'projects/requirements.txt'
)
env.docker.enabled = True
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
您必须指定正确的 docker 图片
我最近开始使用 Azure for ML,并尝试使用机器学习服务工作区。 我已经设置了一个工作区,并将计算集设置为 NC6s-V2 机器,因为我需要在 GPU 上使用图像训练神经网络。
问题是训练仍然在 CPU 上进行——日志显示它无法找到 CUDA。这是 运行 我的脚本时的警告日志。 任何线索如何解决这个问题?
我还在估算器的 conda 包选项中明确提到了 tensorflow-gpu 包。
这是我的估算器代码,
script_params = {
'--input_data_folder': ds.path('dataset').as_mount(),
'--zip_file_name': 'train.zip',
'--run_mode': 'train'
}
est = Estimator(source_directory='./scripts',
script_params=script_params,
compute_target=compute_target,
entry_script='main.py',
conda_packages=['scikit-image', 'keras', 'tqdm', 'pillow', 'matplotlib', 'scipy', 'tensorflow-gpu']
)
run = exp.submit(config=est)
run.wait_for_completion(show_output=True)
计算目标是根据 github 上的示例代码制作的:
compute_name = "P100-NC6s-V2"
compute_min_nodes = 0
compute_max_nodes = 4
vm_size = "STANDARD_NC6S_V2"
if compute_name in ws.compute_targets:
compute_target = ws.compute_targets[compute_name]
if compute_target and type(compute_target) is AmlCompute:
print('found compute target. just use it. ' + compute_name)
else:
print('creating a new compute target...')
provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
min_nodes=compute_min_nodes,
max_nodes=compute_max_nodes)
# create the cluster
compute_target = ComputeTarget.create(
ws, compute_name, provisioning_config)
# can poll for a minimum number of nodes and for a specific timeout.
# if no min node count is provided it will use the scale settings for the cluster
compute_target.wait_for_completion(
show_output=True, min_node_count=None, timeout_in_minutes=20)
# For a more detailed view of current AmlCompute status, use get_status()
print(compute_target.get_status().serialize())
这是使用GPU失败的警告:
2019-08-12 14:50:16.961247: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a7ce570830 executing computations on platform Host. Devices:
2019-08-12 14:50:16.961278: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-08-12 14:50:16.971025: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/azureml-envs/azureml_5fdf05c5671519f307e0f43128b8610e/lib:
2019-08-12 14:50:16.971054: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2019-08-12 14:50:16.971081: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 4bd815dfb0e74e3da901861a4746184f000000
2019-08-12 14:50:16.971089: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 4bd815dfb0e74e3da901861a4746184f000000
2019-08-12 14:50:16.971164: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2019-08-12 14:50:16.971202: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.40.4
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
2019-08-12 14:50:16.973301: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
根据日志,它目前正在使用 CPU。有什么线索可以解决这里的问题吗?
错误的第 3 行似乎表明您尚未在 GPU 上安装 运行 神经网络所需的所有 CUDA 库。确保安装了所有 CUDA 依赖项。如果您不确定,请参阅此堆栈溢出问题:
欢迎来到 SO!
您可以将 Tensorflow Estimator 与 Keras 和其他位于顶部的库一起使用,而不是基础 Estimator。这样您就不必担心设置和配置 GPU 库,因为 Tensorflow Estimator 使用预先配置了 GPU 库的 Docker 图像。
在此处查看文档:
API Reference 您可以使用 conda_packages
参数指定额外的库。同时设置参数 use_gpu = True
.
之前和你有同样的问题,我通过
解决了env = Environment.from_pip_requirements(
name="hello",
file_path=f'projects/requirements.txt'
)
env.docker.enabled = True
env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'
您必须指定正确的 docker 图片