在 deepspeech 内部训练期间出错:无法使用模型配置调用 ThenRnnForward:[rnn_mode、rnn_input_mode、rnn_direction_mode]
Error during training in deepspeech Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]
尝试执行时出现以下错误
%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
--export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
--train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
--learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
--augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
--augment volume[p=0.2,dbfs=-10:-40] \
--augment pitch[p=0.2,pitch=1~0.2] \
--augment tempo[p=0.2,factor=1~0.5]
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s)
found. (0) Internal: Failed to call ThenRnnForward with model config:
[rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers,
input_size, num_units, dir_count, max_seq_length, batch_size,
cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node
tower_0/cudnn_lstm/CudnnRNNV3}}]]
[[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]] (1)
Internal: Failed to call ThenRnnForward with model config: [rnn_mode,
rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers,
input_size, num_units, dir_count, max_seq_length, batch_size,
cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node
tower_0/cudnn_lstm/CudnnRNNV3}}]] 0 successful operations. 0 derived
errors ignored.
如果我按照下面的方式尝试它,效果很好。
%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
--export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
--train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
--learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
# --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
# --augment volume[p=0.2,dbfs=-10:-40] \
# --augment pitch[p=0.2,pitch=1~0.2] \
# --augment tempo[p=0.2,factor=1~0.5]
基本上 augment 是在做一些事情来打破我们之间的训练
此处的最佳猜测是 TensorFlow 运行 内存不足。在这两种情况下,dev、test 和 train 的批量大小都非常大,但扩充需要 额外的 内存。尝试降低 batch_size
并查看训练是否开始,如果开始,则逐渐增加。
尝试执行时出现以下错误
%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
--export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
--train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
--learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
--augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
--augment volume[p=0.2,dbfs=-10:-40] \
--augment pitch[p=0.2,pitch=1~0.2] \
--augment tempo[p=0.2,factor=1~0.5]
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]] [[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 798, 64, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]] 0 successful operations. 0 derived errors ignored.
如果我按照下面的方式尝试它,效果很好。
%cd /content/DeepSpeech
!python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 6 --n_hidden 2048 --epochs 20 \
--export_dir /content/models/ --checkpoint_dir /content/model_checkpoints/ \
--train_files /content/train.csv --dev_files /content/dev.csv --test_files /content/test.csv \
--learning_rate 0.0001 --train_batch_size 64 --test_batch_size 32 --dev_batch_size 32 --export_file_name 'ft_model' \
# --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
# --augment volume[p=0.2,dbfs=-10:-40] \
# --augment pitch[p=0.2,pitch=1~0.2] \
# --augment tempo[p=0.2,factor=1~0.5]
基本上 augment 是在做一些事情来打破我们之间的训练
此处的最佳猜测是 TensorFlow 运行 内存不足。在这两种情况下,dev、test 和 train 的批量大小都非常大,但扩充需要 额外的 内存。尝试降低 batch_size
并查看训练是否开始,如果开始,则逐渐增加。