在 Common Voice 数据集上训练 DeepSpeech 在 gpu 上出错

Train DeepSpeech on Common Voice dataset gives error on gpu

我正在尝试在 documentation 中所述的 Common Voice 数据集上训练 DeepSpeech 模型。但它给出了以下错误:

I0421 11:34:32.779112 140581195995008 utils.py:157] NumExpr defaulting to 2 threads.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by {{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/DeepSpeech/DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 529, in train
    load_or_init_graph_for_training(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 137, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 112, in _load_or_init_impl
    return _initialize_all_variables(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 88, in _initialize_all_variables
    session.run(v.initializer)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

我的本地机器规格如下:

python 3.7; Cuda 10.1; CuDNN 7.6.5; tensorflow-gpu 1.15.2; GPU GTX 1050 ti

我还安装了以下包和库来准备环境:

!apt-add-repository universe
!apt-get install sox libsox-fmt-mp3 cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
!python3.7 -m pip install sox
!python3.7 -m pip install deepspeech-gpu
!python3.7 -m pip install tensorflow-gpu==1.15.2
!python3.7 -m pip install numpy==1.19.5
!python3.7 -m pip install progressbar2
!python3.7 -m pip install progressbar
!python3.7 -m pip install progressbar33
!python3.7 -m pip install ds_ctcdecoder==0.10.0-alpha.3
!python3.7 -m pip install pyogg==0.6.14a1
!python3.7 -m pip install deepspeech
!git clone --branch v0.9.3 https://github.com/mozilla/DeepSpeech
!python3.7 -m pip install --upgrade --force-reinstall -e ./DeepSpeech/
!git clone https://github.com/kpu/kenlm.git
!mkdir -p build
!cmake kenlm
!make -j 4
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-checkpoint.tar.gz
!curl -LO "https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.amd64.cuda.linux.tar.xz"
!mkdir native_client
!tar xvf native_client.amd64.cuda.linux.tar.xz -C native_client

我在本地计算机和 google colab vm 上都遇到了同样的问题。

编辑:我还将我的 cuda 和 cudnn 版本分别更改为 10.0 和 7.5.6。但是错误已经存在。

我看过 similar error posted on the DeepSpeech Discourse,问题是 CUDA 安装。

您的 $LD_LIBRARY_PATH 环境变量的值是多少?

您可以通过以下方式找到它:

$ echo $LD_LIBRARY_PATH
/usr/lib/x86_64-linux-gnu:/usr/local/cuda/bin:/usr/local/cuda/lib64:/usr/local/cuda-11.2/targets/x86_64-linux/lib

我的怀疑 CUDA 无法找到正确的库。

感谢您提供更多信息 Soroush。

LD_LIBRARY_PATH 看起来不错,我假设库 实际上 在这些路径中。

接下来,我要确保代码在 GPU 本身上执行。

代码可能无法在 GPU 上执行的原因有很多。您提到您的环境是根据 DeepSpeech PlayBook 设置的,这意味着它使用 Docker。那是对的吗?如果是这样,您是使用 gpus -all 参数生成的 Docker 容器吗?

接下来要检查的是 nvtop 是否正在从 DeepSpeech 报告 GPU activity。当 DeepSpeech.py 脚本是 运行 时,这应该会导致高 compute 负载,在 nvtop 中可以观察到。如果你没有看到这个,这意味着代码可能没有在 GPU 上执行,这可以解释 No OpKernel 错误。

我已经解决了这个问题。问题是由 Tensorflow 的版本引起的。作为 我之前提到过,我使用的是 Tf 1.15.2,而我不得不使用 Tf 1.15.4。