未知 error/crash - 带 GPU 的 TensorFlow LSTM(第一个时期开始后无输出)
Unknown error/crash - TensorFlow LSTM with GPU (no output after start of 1st epoch)
我正在尝试使用 LSTM 层训练模型。我正在使用 GPU 并加载了所有需要的库。
当我以这种方式构建模型时:
model = keras.Sequential()
model.add(layers.LSTM(256, activation="relu", return_sequences=False)) # note the activation function
model.add(layers.Dropout(0.2))
model.add(layers.Dense(256, activation="relu"))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(1))
model.add(layers.Activation(activation="sigmoid"))
model.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer="adam",
metrics=["accuracy"]
)
有效。但它在 LSTM 层上使用 activation="relu"
,所以它不是 CuDNNLSTM - 当激活函数为 tanh(默认)时自动选择 - 如果我没记错的话。
所以,它太慢了,我想 运行 更快的 CuDNNLSTM。我的代码:
model = keras.Sequential()
model.add(layers.LSTM(256, return_sequences=False))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(256, activation="relu"))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(1))
model.add(layers.Activation(activation="sigmoid"))
model.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer="adam",
metrics=["accuracy"]
)
基本一样,只是没有提供激活函数,所以会用到tanh。
但是现在不是训练,输出的结尾是这样的:
2021-04-19 22:41:46.046218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 22:41:46.046426: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-19 22:41:46.046642: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-19 22:41:46.046942: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-19 22:41:46.047124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-19 22:41:46.047312: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-19 22:41:46.047489: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-19 22:41:46.047663: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-19 22:41:46.047936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-19 22:41:46.665456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-19 22:41:46.665712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-04-19 22:41:46.665876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-04-19 22:41:46.666186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2982 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-04-19 22:41:46.667505: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-19 22:42:07.374456: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/50
2021-04-19 22:42:08.922891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-19 22:42:09.272264: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-19 22:42:09.302667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
Process finished with exit code -1073740791 (0xC0000409)
它只是开始第一个纪元,然后冻结一分钟并以这个奇怪的退出代码退出。
- 输入数据的形状:
tf.Tensor([50985 29 7], shape=(3,), dtype=int32)
- 我的显卡:
Nvidia GTX 1050 Ti
- CUDA:
v11.3
- OS:
Windows 10
- IDE:
PyCharm
寻找这个问题的解决方案有点具有挑战性,因为我没有输出任何错误。难道我做错了什么?有没有人遇到过类似的问题?应该有什么帮助?
// 编辑;我试过了:
- 运行使用更少的单元(2 个而不是 256 个)和更低的 batch_size
- 使用 python
3.7.1
将 tensorflow 降级为 2.4.0
,将 CUDA 降级为 11.0
,将 cudnn 降级为 8.0.1
(这应该是正确的组合 this list from TensorFlow website)
- 正在重启我的电脑 :)
我找到了解决方案...有点。
因此,当我将 tensorflow 降级为 2.1.0
,将 CUDA 降级为 10.1
并将 cudnn 降级为 7.6.5
(当时是 this list on TensorFlow website 的第 4 个组合)时,它可以正常工作
我不知道为什么它在最新版本或 tensorflow 的有效组合中不起作用 2.4.0
。
它运行良好,所以我的问题已解决。尽管如此,很高兴知道为什么在更高版本上将 LSTM 与 cudnn 结合使用对我不起作用,因为我在任何地方都没有发现这个问题。
我正在尝试使用 LSTM 层训练模型。我正在使用 GPU 并加载了所有需要的库。
当我以这种方式构建模型时:
model = keras.Sequential()
model.add(layers.LSTM(256, activation="relu", return_sequences=False)) # note the activation function
model.add(layers.Dropout(0.2))
model.add(layers.Dense(256, activation="relu"))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(1))
model.add(layers.Activation(activation="sigmoid"))
model.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer="adam",
metrics=["accuracy"]
)
有效。但它在 LSTM 层上使用 activation="relu"
,所以它不是 CuDNNLSTM - 当激活函数为 tanh(默认)时自动选择 - 如果我没记错的话。
所以,它太慢了,我想 运行 更快的 CuDNNLSTM。我的代码:
model = keras.Sequential()
model.add(layers.LSTM(256, return_sequences=False))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(256, activation="relu"))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(1))
model.add(layers.Activation(activation="sigmoid"))
model.compile(
loss=keras.losses.BinaryCrossentropy(),
optimizer="adam",
metrics=["accuracy"]
)
基本一样,只是没有提供激活函数,所以会用到tanh。 但是现在不是训练,输出的结尾是这样的:
2021-04-19 22:41:46.046218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 22:41:46.046426: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-19 22:41:46.046642: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-19 22:41:46.046942: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-19 22:41:46.047124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-19 22:41:46.047312: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-19 22:41:46.047489: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-19 22:41:46.047663: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-19 22:41:46.047936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-19 22:41:46.665456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-19 22:41:46.665712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-04-19 22:41:46.665876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-04-19 22:41:46.666186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2982 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-04-19 22:41:46.667505: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-19 22:42:07.374456: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/50
2021-04-19 22:42:08.922891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-19 22:42:09.272264: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-19 22:42:09.302667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
Process finished with exit code -1073740791 (0xC0000409)
它只是开始第一个纪元,然后冻结一分钟并以这个奇怪的退出代码退出。
- 输入数据的形状:
tf.Tensor([50985 29 7], shape=(3,), dtype=int32)
- 我的显卡:
Nvidia GTX 1050 Ti
- CUDA:
v11.3
- OS:
Windows 10
- IDE:
PyCharm
寻找这个问题的解决方案有点具有挑战性,因为我没有输出任何错误。难道我做错了什么?有没有人遇到过类似的问题?应该有什么帮助?
// 编辑;我试过了:
- 运行使用更少的单元(2 个而不是 256 个)和更低的 batch_size
- 使用 python
3.7.1
将 tensorflow 降级为2.4.0
,将 CUDA 降级为11.0
,将 cudnn 降级为8.0.1
(这应该是正确的组合 this list from TensorFlow website) - 正在重启我的电脑 :)
我找到了解决方案...有点。
因此,当我将 tensorflow 降级为 2.1.0
,将 CUDA 降级为 10.1
并将 cudnn 降级为 7.6.5
(当时是 this list on TensorFlow website 的第 4 个组合)时,它可以正常工作
我不知道为什么它在最新版本或 tensorflow 的有效组合中不起作用 2.4.0
。
它运行良好,所以我的问题已解决。尽管如此,很高兴知道为什么在更高版本上将 LSTM 与 cudnn 结合使用对我不起作用,因为我在任何地方都没有发现这个问题。