经过多次训练后,CuDNN 在 TF 2.x 中崩溃

CuDNN crash in TF 2.x after many epochs of training

我目前对我的 tensorflow 项目越来越绝望。 安装 tensorflow 花了好几个小时,直到我发现 PyCharm、Python 3.7 和 TF 2.x 不知何故不兼容。现在是 运行,但经过多次训练后,我得到了一个非常不明确的 CuDNN 错误。 你知道我的代码是否错误或者是否有例如安装错误?你能给我一个方向吗? 我也没有找到任何具体的搜索内容。

我的设置 [括号里是我也试过的]:

此错误发生在训练约 3 小时后。在其他情况下(或网络的参数化)错误发生得更早。在这里您可以看到下面代码片段的完整输出:

C:\Users\Fhnx\.virtualenvs\Processing-TA9ofq3q\Scripts\python.exe C:/Users/Fhnx/.../playground/AI_Predictor_Test.py
2020-05-08 11:47:25.924424: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Starting training sweep with Epochs: 10000, LRstart: 0.01, LRend: 5e-05
2020-05-08 11:47:27.887135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-05-08 11:47:27.912998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.815GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 11:47:27.913212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-08 11:47:27.921203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-08 11:47:27.930115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-08 11:47:27.932760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-08 11:47:27.944938: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-08 11:47:27.952321: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-08 11:47:27.960042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-08 11:47:27.960698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-08 11:47:27.961058: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-05-08 11:47:27.969636: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2df4e1dcd00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-08 11:47:27.969831: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-08 11:47:27.970579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.815GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 11:47:27.970964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-08 11:47:27.971208: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-08 11:47:27.971389: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-08 11:47:27.971602: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-08 11:47:27.971839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-08 11:47:27.972112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-08 11:47:27.972324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-08 11:47:27.973322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-08 11:47:28.530960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-08 11:47:28.531109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-05-08 11:47:28.531180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-05-08 11:47:28.532337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6213 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-05-08 11:47:28.534819: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2df7aeb31a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-08 11:47:28.534946: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5
Model: "model"
Layer (type)                    Output Shape         Param #     Connected to
input_1 (InputLayer)            [(None, 22)]         0
tf_op_layer_ExpandDims (TensorF [(None, 22, 1)]      0           input_1[0][0]
dense (Dense)                   (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
dense_3 (Dense)                 (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
dense_6 (Dense)                 (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
dense_9 (Dense)                 (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
dense_12 (Dense)                (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
dense_15 (Dense)                (None, 22, 64)       128         tf_op_layer_ExpandDims[0][0]
gaussian_dropout (GaussianDropo (None, 22, 64)       0           dense[0][0]
gaussian_dropout_2 (GaussianDro (None, 22, 64)       0           dense_3[0][0]
gaussian_dropout_4 (GaussianDro (None, 22, 64)       0           dense_6[0][0]
gaussian_dropout_6 (GaussianDro (None, 22, 64)       0           dense_9[0][0]
gaussian_dropout_8 (GaussianDro (None, 22, 64)       0           dense_12[0][0]
gaussian_dropout_10 (GaussianDr (None, 22, 64)       0           dense_15[0][0]
bidirectional (Bidirectional)   (None, 22, 16)       4672        gaussian_dropout[0][0]
bidirectional_2 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_2[0][0]
bidirectional_4 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_4[0][0]
bidirectional_6 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_6[0][0]
bidirectional_8 (Bidirectional) (None, 22, 16)       4672        gaussian_dropout_8[0][0]
bidirectional_10 (Bidirectional (None, 22, 16)       4672        gaussian_dropout_10[0][0]
bidirectional_1 (Bidirectional) (None, 22, 16)       1600        bidirectional[0][0]
bidirectional_3 (Bidirectional) (None, 22, 16)       1600        bidirectional_2[0][0]
bidirectional_5 (Bidirectional) (None, 22, 16)       1600        bidirectional_4[0][0]
bidirectional_7 (Bidirectional) (None, 22, 16)       1600        bidirectional_6[0][0]
bidirectional_9 (Bidirectional) (None, 22, 16)       1600        bidirectional_8[0][0]
bidirectional_11 (Bidirectional (None, 22, 16)       1600        bidirectional_10[0][0]
conv1d (Conv1D)                 (None, 20, 13)       1780        bidirectional_1[0][0]
conv1d_4 (Conv1D)               (None, 20, 13)       1780        bidirectional_3[0][0]
conv1d_8 (Conv1D)               (None, 20, 13)       1780        bidirectional_5[0][0]
conv1d_12 (Conv1D)              (None, 20, 13)       1780        bidirectional_7[0][0]
conv1d_16 (Conv1D)              (None, 20, 13)       1780        bidirectional_9[0][0]
conv1d_20 (Conv1D)              (None, 20, 13)       1780        bidirectional_11[0][0]
conv1d_1 (Conv1D)               (None, 20, 10)       1620        conv1d[0][0]
conv1d_5 (Conv1D)               (None, 20, 10)       1620        conv1d_4[0][0]
conv1d_9 (Conv1D)               (None, 20, 10)       1620        conv1d_8[0][0]
conv1d_13 (Conv1D)              (None, 20, 10)       1620        conv1d_12[0][0]
conv1d_17 (Conv1D)              (None, 20, 10)       1620        conv1d_16[0][0]
conv1d_21 (Conv1D)              (None, 20, 10)       1620        conv1d_20[0][0]
conv1d_2 (Conv1D)               (None, 20, 7)        1620        conv1d_1[0][0]
conv1d_6 (Conv1D)               (None, 20, 7)        1620        conv1d_5[0][0]
conv1d_10 (Conv1D)              (None, 20, 7)        1620        conv1d_9[0][0]
conv1d_14 (Conv1D)              (None, 20, 7)        1620        conv1d_13[0][0]
conv1d_18 (Conv1D)              (None, 20, 7)        1620        conv1d_17[0][0]
conv1d_22 (Conv1D)              (None, 20, 7)        1620        conv1d_21[0][0]
conv1d_3 (Conv1D)               (None, 20, 4)        1620        conv1d_2[0][0]
conv1d_7 (Conv1D)               (None, 20, 4)        1620        conv1d_6[0][0]
conv1d_11 (Conv1D)              (None, 20, 4)        1620        conv1d_10[0][0]
conv1d_15 (Conv1D)              (None, 20, 4)        1620        conv1d_14[0][0]
conv1d_19 (Conv1D)              (None, 20, 4)        1620        conv1d_18[0][0]
conv1d_23 (Conv1D)              (None, 20, 4)        1620        conv1d_22[0][0]
batch_normalization (BatchNorma (None, 20, 4)        16          conv1d_3[0][0]
batch_normalization_1 (BatchNor (None, 20, 4)        16          conv1d_7[0][0]
batch_normalization_2 (BatchNor (None, 20, 4)        16          conv1d_11[0][0]
batch_normalization_3 (BatchNor (None, 20, 4)        16          conv1d_15[0][0]
batch_normalization_4 (BatchNor (None, 20, 4)        16          conv1d_19[0][0]
batch_normalization_5 (BatchNor (None, 20, 4)        16          conv1d_23[0][0]
dense_1 (Dense)                 (None, 20, 128)      640         batch_normalization[0][0]
dense_4 (Dense)                 (None, 20, 128)      640         batch_normalization_1[0][0]
dense_7 (Dense)                 (None, 20, 128)      640         batch_normalization_2[0][0]
dense_10 (Dense)                (None, 20, 128)      640         batch_normalization_3[0][0]
dense_13 (Dense)                (None, 20, 128)      640         batch_normalization_4[0][0]
dense_16 (Dense)                (None, 20, 128)      640         batch_normalization_5[0][0]
gaussian_dropout_1 (GaussianDro (None, 20, 128)      0           dense_1[0][0]
gaussian_dropout_3 (GaussianDro (None, 20, 128)      0           dense_4[0][0]
gaussian_dropout_5 (GaussianDro (None, 20, 128)      0           dense_7[0][0]
gaussian_dropout_7 (GaussianDro (None, 20, 128)      0           dense_10[0][0]
gaussian_dropout_9 (GaussianDro (None, 20, 128)      0           dense_13[0][0]
gaussian_dropout_11 (GaussianDr (None, 20, 128)      0           dense_16[0][0]
flatten (Flatten)               (None, 2560)         0           gaussian_dropout_1[0][0]
flatten_1 (Flatten)             (None, 2560)         0           gaussian_dropout_3[0][0]
flatten_2 (Flatten)             (None, 2560)         0           gaussian_dropout_5[0][0]
flatten_3 (Flatten)             (None, 2560)         0           gaussian_dropout_7[0][0]
flatten_4 (Flatten)             (None, 2560)         0           gaussian_dropout_9[0][0]
flatten_5 (Flatten)             (None, 2560)         0           gaussian_dropout_11[0][0]
dense_2 (Dense)                 (None, 1)            2561        flatten[0][0]
dense_5 (Dense)                 (None, 1)            2561        flatten_1[0][0]
dense_8 (Dense)                 (None, 1)            2561        flatten_2[0][0]
dense_11 (Dense)                (None, 1)            2561        flatten_3[0][0]
dense_14 (Dense)                (None, 1)            2561        flatten_4[0][0]
dense_17 (Dense)                (None, 1)            2561        flatten_5[0][0]
concatenate (Concatenate)       (None, 6)            0           dense_2[0][0]
Total params: 97,542
Trainable params: 97,494
Non-trainable params: 48
***** Training Net ForkedConvLSTM_D64_LSTM2x8_Conv4x20x4_D1x128_dr0.40 now *****
BatchSize: 2108, NumNetParams: 97542, Feature shape: (500000, 22), Output shape: (500000, 6), In/Out Elem.: 14.0000M with est. size: 448.0000 MB
Epoch 1/10000
2020-05-08 11:47:57.675309: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-08 11:47:57.962354: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-08 11:47:59.216097: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
238/238 [==============================] - 21s 90ms/step - loss: 0.3145 - val_loss: 0.0846 - lr: 0.0100
Epoch 2/10000
238/238 [==============================] - 15s 62ms/step - loss: 0.0851 - val_loss: 0.0837 - lr: 0.0100
Epoch 694/10000
238/238 [==============================] - 14s 61ms/step - loss: 0.0833 - val_loss: 0.0836 - lr: 5.0000e-05
Epoch 695/10000
  6/238 [..............................] - ETA: 12s - loss: 0.08302020-05-08 14:39:02.141015: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1986): 'cudnnRNNBackwardData( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, output_desc.handles(), output_data.opaque(), output_desc.handles(), output_backprop_data.opaque(), output_h_desc.handle(), output_h_backprop_data.opaque(), output_c_desc.handle(), output_c_backprop_data.opaque(), rnn_desc.params_handle(), params.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), input_desc.handles(), input_backprop_data->opaque(), input_h_desc.handle(), input_h_backprop_data->opaque(), input_c_desc.handle(), input_c_backprop_data->opaque(), workspace.opaque(), workspace.size(), reserve_space_data->opaque(), reserve_space_data->size())'
2020-05-08 14:39:02.141642: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at cudnn_rnn_ops.cc:1922 : Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 16, 8, 1, 22, 2108, 8]
2020-05-08 14:39:02.141037: F tensorflow/stream_executor/cuda/cuda_dnn.cc:189] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.
Process finished with exit code -1073740791 (0xC0000409)

下面是一些代码,应该能够 运行 并产生以上输出:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# from os import environ
# environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *
import tensorflow as tf
import numpy as np
import sys

def build_model_simple(inputLength=1, outputLength=1, lr=0.0001, device="/gpu:0",
                       numLSTM=2, nNeuLSTM=8,
                       numConv=4, nFiltConv=20, szConvKernel=4,
                       numDenseInner=1, nNeuDenseInner=128):
    with tf.device(device):
        input = Input(shape=(inputLength,), dtype=tf.float32)
        inputExp = tf.expand_dims(input, -1)
        allInner = []
        for _ in range(outputLength):
            inner = Dense(nNeuFirstDense, activation="linear")(inputExp)
            inner = GaussianDropout(rate=dropoutRate)(inner)

            if numLSTM and nNeuLSTM:
                for _ in range(numLSTM):
                    inner = (Bidirectional(LSTM(nNeuLSTM, return_sequences=True))(inner))

            if numConv:
                for _ in range(numConv):
                    inner = Conv1D(filters=nFiltConv, kernel_size=szConvKernel,
                                   strides=1, padding='valid',
                inner = BatchNormalization()(inner)

            if numDenseInner:
                for _ in range(numDenseInner):
                    inner = Dense(nNeuDenseInner, activation="linear")(inner)
                    inner = GaussianDropout(rate=dropoutRate)(inner)
            inner = Flatten()(inner)
            inner = Dense(1, activation="linear")(inner)
        out = Concatenate()(allInner)
        # out = outTmp * outTmp * outTmp
        model = Model(inputs=input, outputs=out)

        model.compile(loss="mse", optimizer=Adam(lr=lr))
        # model.compile(loss="mse", optimizer=Adadelta())
        return model, 'ForkedConvLSTM_D{}_LSTM{}x{}_Conv{}x{}x{}_D{}x{}_dr{:.2f}'.format(
            numLSTM, nNeuLSTM,
            numConv, nFiltConv, szConvKernel,
            numDenseInner, nNeuDenseInner,

def scheduler(epoch, lrStart, lrEnd, lrDecay=0.05, lrNStable=10):
    lr = lrStart
    if epoch > lrNStable:
        fac = tf.math.exp(lrDecay * (lrNStable - epoch))
        lr = lrStart * fac + lrEnd * (1 - fac)
    return lr

if __name__ == '__main__':
    numFeatures = 22
    numOutputs = 6

    trainIn = np.random.rand(500000, numFeatures)
    trainOut = np.random.rand(500000, numOutputs)
    valiIn = np.random.rand(12000, numFeatures)
    valiOut = np.random.rand(12000, numOutputs)

    numDataElements = trainIn.shape[0] * (trainIn.shape[1] + trainOut.shape[1])
    sizeCalc = numDataElements * sys.getsizeof(trainIn[0][0])

    EPOCHS = 10000
    LEARNING_RATE_END = 0.00005

    print("Starting training sweep with Epochs: {}, LRstart: {}, LRend: {}".format(

    network, nwName = build_model_simple(inputLength=numFeatures, outputLength=numOutputs)

    netWeights = network.get_weights()
    numNetPrams = np.sum([np.prod(ele.shape) for ele in netWeights])

    # Estimation of Batch Size: GRAM * RAM Factor / NumParams in Net = ~75k. This divided by 30 for to get a
    # good rough estimate for the batch size
    BATCH_SIZE = int(np.floor(8 * 1e9 * 0.9 / numNetPrams / 35))

    print("***** Training Net {} now *****".format(nwName))
    print("BatchSize: {}, NumNetParams: {}, Feature shape: {}, Output shape: "
                 "{}, In/Out Elem.: {:.4f}M with est. size: {:.4f} MB".format(
        BATCH_SIZE, numNetPrams, trainIn.shape, trainOut.shape,
        numDataElements / 1e6, sizeCalc / 1e6))

    callback = tf.keras.callbacks.LearningRateScheduler(
    fitRes = network.fit(trainIn, trainOut, batch_size=BATCH_SIZE, epochs=EPOCHS,
                         validation_data=(valiIn, valiOut),
                         callbacks=[callback, tf.keras.callbacks.TerminateOnNaN()],



我玩了很多不同的版本。 我什至试图通过将新 dll 与旧名称进行符号链接来让 CUDA 10.2 工作。 但即使这样也没有修复错误。

我终于设法让它工作了,方法是删除所有 NVidia 内容(包括驱动程序)并使用此版本的工作室驱动程序安装最新的 10.1 版本(从 19 年底开始)。所以版本 431.86,而不是最新的工作室版本 441.66。
