Failed to get convolution algorithm error ~ tensorflow-gpu on ubuntu 20.04

Question

我有 NVIDIA 2070 RTX GPU，我的 OS 是 Ubuntu20.04。

我已经用 conda 安装了 tensorflow-gpu 包。我没有安装了 CUDA-toolkit 我相信它还会从 CUDA-toolkit 安装所需的库以使用 gpu-acceleration，因为 conda install tensorflow-gpu 给出了以下软件包列表将安装：

Collecting package metadata (current_repodata.json): done
Solving environment: done


## Package Plan ##

  environment location: /home/psychotechnopath/anaconda3/envs/DeepLearning3.6

  added / updated specs:
    - tensorflow-gpu


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _tflow_select-2.1.0        |              gpu           2 KB
    absl-py-0.9.0              |           py36_0         167 KB
    asn1crypto-1.3.0           |           py36_0         164 KB
    astor-0.8.0                |           py36_0          46 KB
    blinker-1.4                |           py36_0          22 KB
    c-ares-1.15.0              |    h7b6447c_1001          89 KB
    cachetools-3.1.1           |             py_0          14 KB
    cffi-1.14.0                |   py36h2e261b9_0         223 KB
    chardet-3.0.4              |        py36_1003         180 KB
    click-7.1.1                |             py_0          71 KB
    cryptography-2.8           |   py36h1ba5d50_0         552 KB
    cudatoolkit-10.1.243       |       h6bb024c_0       347.4 MB
    cudnn-7.6.5                |       cuda10.1_0       179.9 MB
    cupti-10.1.168             |                0         1.4 MB
    gast-0.2.2                 |           py36_0         155 KB
    google-auth-1.13.1         |             py_0          57 KB
    google-auth-oauthlib-0.4.1 |             py_2          20 KB
    google-pasta-0.2.0         |             py_0          44 KB
    grpcio-1.27.2              |   py36hf8bcb03_0         1.3 MB
    h5py-2.10.0                |   py36h7918eee_0         1.0 MB
    idna-2.9                   |             py_1          49 KB
    keras-applications-1.0.8   |             py_0          33 KB
    keras-preprocessing-1.1.0  |             py_1          36 KB
    libprotobuf-3.11.4         |       hd408876_0         2.9 MB
    markdown-3.1.1             |           py36_0         116 KB
    mkl-service-2.3.0          |   py36he904b0f_0         219 KB
    mkl_fft-1.0.15             |   py36ha843d7b_0         155 KB
    mkl_random-1.1.0           |   py36hd6b4f25_0         324 KB
    numpy-1.18.1               |   py36h4f9e942_0           5 KB
    numpy-base-1.18.1          |   py36hde5b4d6_1         4.2 MB
    oauthlib-3.1.0             |             py_0          88 KB
    opt_einsum-3.1.0           |             py_0          54 KB
    protobuf-3.11.4            |   py36he6710b0_0         635 KB
    pyasn1-0.4.8               |             py_0          58 KB
    pyasn1-modules-0.2.7       |             py_0          63 KB
    pycparser-2.20             |             py_0          92 KB
    pyjwt-1.7.1                |           py36_0          33 KB
    pyopenssl-19.1.0           |           py36_0          87 KB
    pysocks-1.7.1              |           py36_0          30 KB
    requests-2.23.0            |           py36_0          91 KB
    requests-oauthlib-1.3.0    |             py_0          22 KB
    rsa-4.0                    |             py_0          29 KB
    scipy-1.4.1                |   py36h0b6359f_0        14.6 MB
    six-1.14.0                 |           py36_0          27 KB
    tensorboard-2.1.0          |            py3_0         3.3 MB
    tensorflow-2.1.0           |gpu_py36h2e5cdaa_0           4 KB
    tensorflow-base-2.1.0      |gpu_py36h6c5654b_0       155.9 MB
    tensorflow-estimator-2.1.0 |     pyhd54b08b_0         251 KB
    tensorflow-gpu-2.1.0       |       h0d30ee6_0           3 KB
    termcolor-1.1.0            |           py36_1           8 KB
    urllib3-1.25.8             |           py36_0         169 KB
    werkzeug-1.0.1             |             py_0         240 KB
    wrapt-1.12.1               |   py36h7b6447c_1          49 KB
    ------------------------------------------------------------
                                           Total:       716.6 MB

当我检查是否检测到我的 GPU 时，使用：

import tensorflow as tf
print(tf.__version__)
print("Num GPUs Available: ", tf.config.experimental.list_physical_devices('GPU'))

它检测到我的 GPU，但它似乎有一些（我不知道的）NUMA 错误。

2020-05-01 11:39:26.778829: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-01 11:39:26.799789: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:39:26.800132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-01 11:39:26.800234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-01 11:39:26.801035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-01 11:39:26.801981: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-01 11:39:26.802098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-01 11:39:26.802926: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-01 11:39:26.803409: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-01 11:39:26.805224: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-01 11:39:26.805297: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:39:26.805669: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:39:26.805974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0

这是打印语句：

Num GPUs Available:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

随后，当我尝试运行一个卷积神经网络时，我得到以下 output/error（我决定包括完整的输出，因为我不知道哪个部分是相关的，哪个部分是相关的不是；对于那里的所有 tensorflow 专家：请随意编辑输出中不相关的部分）

2020-05-01 11:41:53.682279: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-01 11:41:53.703168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:53.703512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-01 11:41:53.703618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-01 11:41:53.704375: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-01 11:41:53.705278: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-01 11:41:53.705394: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-01 11:41:53.706237: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-01 11:41:53.706725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-01 11:41:53.708557: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-01 11:41:53.708630: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:53.708994: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:53.709299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-01 11:41:53.709511: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-05-01 11:41:53.733654: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3792915000 Hz
2020-05-01 11:41:53.734418: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ad4b26e7d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-01 11:41:53.734434: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-01 11:41:53.734576: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:53.735123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:08:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-01 11:41:53.735146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-01 11:41:53.735157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-01 11:41:53.735167: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-01 11:41:53.735176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-01 11:41:53.735186: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-01 11:41:53.735195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-01 11:41:53.735204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-01 11:41:53.735259: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:53.735820: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:53.736333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-01 11:41:53.736360: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-01 11:41:54.012838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-01 11:41:54.012856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-05-01 11:41:54.012861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-05-01 11:41:54.012980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:54.013316: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:54.013643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-01 11:41:54.013951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7011 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:08:00.0, compute capability: 7.5)
2020-05-01 11:41:54.015048: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ad4ef1fe00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-01 11:41:54.015055: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-05-01 11:41:54.619977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-01 11:41:54.765976: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-01 11:41:55.109936: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-01 11:41:55.123585: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-01 11:41:55.123654: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node sequential/conv2d/Conv2D}}]]
Traceback (most recent call last):
  File "/home/psychotechnopath/MEGA/Machine Learning/11. Deep learning for Python/5. Convolutional neural networks/CH19_Digits.py", line 66, in <module>
    model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=200, epochs=10, verbose=2)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 632, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/psychotechnopath/anaconda3/envs/DeepLearning3.6/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node sequential/conv2d/Conv2D (defined at /MEGA/Machine Learning/11. Deep learning for Python/5. Convolutional neural networks/CH19_Digits.py:66) ]] [Op:__inference_distributed_function_1027]

Function call stack:
distributed_function

Answer 1

这似乎是tensorflow中的一个已知错误，它与tensorflow在20XX卡中进行的内存分配有关。在此处查看详细线程：

https://github.com/tensorflow/tensorflow/issues/24496

解决我问题的方法是在我的脚本顶部添加以下代码：

tf.config.experimental.set_memory_growth(tf.config.list_physical_devices('GPU')[0], True)