在 Common Voice 数据集上训练 DeepSpeech 在 gpu 上出错

Question

我正在尝试在 documentation 中所述的 Common Voice 数据集上训练 DeepSpeech 模型。但它给出了以下错误：

I0421 11:34:32.779112 140581195995008 utils.py:157] NumExpr defaulting to 2 threads.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by {{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/DeepSpeech/DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 529, in train
    load_or_init_graph_for_training(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 137, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 112, in _load_or_init_impl
    return _initialize_all_variables(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 88, in _initialize_all_variables
    session.run(v.initializer)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

我的本地机器规格如下：

python 3.7; Cuda 10.1; CuDNN 7.6.5; tensorflow-gpu 1.15.2; GPU GTX 1050 ti

我还安装了以下包和库来准备环境：

!apt-add-repository universe
!apt-get install sox libsox-fmt-mp3 cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
!python3.7 -m pip install sox
!python3.7 -m pip install deepspeech-gpu
!python3.7 -m pip install tensorflow-gpu==1.15.2
!python3.7 -m pip install numpy==1.19.5
!python3.7 -m pip install progressbar2
!python3.7 -m pip install progressbar
!python3.7 -m pip install progressbar33
!python3.7 -m pip install ds_ctcdecoder==0.10.0-alpha.3
!python3.7 -m pip install pyogg==0.6.14a1
!python3.7 -m pip install deepspeech
!git clone --branch v0.9.3 https://github.com/mozilla/DeepSpeech
!python3.7 -m pip install --upgrade --force-reinstall -e ./DeepSpeech/
!git clone https://github.com/kpu/kenlm.git
!mkdir -p build
!cmake kenlm
!make -j 4
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-checkpoint.tar.gz
!curl -LO "https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.amd64.cuda.linux.tar.xz"
!mkdir native_client
!tar xvf native_client.amd64.cuda.linux.tar.xz -C native_client

我在本地计算机和 google colab vm 上都遇到了同样的问题。

编辑：我还将我的 cuda 和 cudnn 版本分别更改为 10.0 和 7.5.6。但是错误已经存在。

Answer 1

我看过 similar error posted on the DeepSpeech Discourse，问题是 CUDA 安装。

您的 $LD_LIBRARY_PATH 环境变量的值是多少？

您可以通过以下方式找到它：

$ echo $LD_LIBRARY_PATH
/usr/lib/x86_64-linux-gnu:/usr/local/cuda/bin:/usr/local/cuda/lib64:/usr/local/cuda-11.2/targets/x86_64-linux/lib

我的怀疑 CUDA 无法找到正确的库。

Answer 2

感谢您提供更多信息 Soroush。

LD_LIBRARY_PATH 看起来不错，我假设库 实际上 在这些路径中。

接下来，我要确保代码在 GPU 本身上执行。

代码可能无法在 GPU 上执行的原因有很多。您提到您的环境是根据 DeepSpeech PlayBook 设置的，这意味着它使用 Docker。那是对的吗？如果是这样，您是使用 gpus -all 参数生成的 Docker 容器吗？

接下来要检查的是 nvtop 是否正在从 DeepSpeech 报告 GPU activity。当 DeepSpeech.py 脚本是运行时，这应该会导致高 compute 负载，在 nvtop 中可以观察到。如果你没有看到这个，这意味着代码可能没有在 GPU 上执行，这可以解释 No OpKernel 错误。

Answer 3

我已经解决了这个问题。问题是由 Tensorflow 的版本引起的。作为我之前提到过，我使用的是 Tf 1.15.2，而我不得不使用 Tf 1.15.4。

在 Common Voice 数据集上训练 DeepSpeech 在 gpu 上出错

Train DeepSpeech on Common Voice dataset gives error on gpu

python

speech-recognition

deep-learning

tensorflow

mozilla-deepspeech