为什么我的 GPU 在训练数据时会中断？

Question

我花了几个小时配置我的电脑，最后在 GPU 而不是 CPU 上制作 python 训练数据。但是，由于某种原因，我的模型在它们的 epoch 中间总是被中断，我无法完成模型的训练。

等待电脑并不能解决这个问题，我也无法中断内核。我尝试了其他人的解决方案，但仍然没有太多运气。

如果我使用 CPU（以爬行速度），我可以正常训练我的模型，但是当我切换到 GPU 时，我的模型在它们中途挂断之前训练得非常快，没有完成所有所需的时代。我的 python 内核在那之后也卡在运行上，除非我从任务管理器中终止整个事情，否则我无法中断它。

根据我的任务管理器性能历史记录，在训练期间我的 GPU 会持续出现峰值，这是预期的。但是，当它挂断时，我的 GPU activity 会回到 0，即使我的内核指示训练仍处于其纪元的中间。这是随机发生的，不依赖于时间或轮数，尽管我训练数据的时间越长，发生的可能性就越大。

这是我的顺序模型。

def prepare_sequences(notes, n_vocab, seq_len):
    """ Prepare the sequences used by the Neural Network """
    sequence_length = seq_len

    names = sorted(set(item for item in notes))
    note_to_int = dict((note, number) for number, note in enumerate(names))

    network_input = []
    network_output = []

    # create input sequences and the corresponding outputs
    for i in range(0, len(notes) - sequence_length, 1):
        sequence_in = notes[i:i + sequence_length]
        sequence_out = notes[i + sequence_length]
        network_input.append([note_to_int[char] for char in sequence_in])
        network_output.append(note_to_int[sequence_out])

    n_patterns = len(network_input)

    # reshape the input into a format compatible with LSTM layers
    network_input = numpy.reshape(network_input, (n_patterns, sequence_length, 1))
    # normalize input
    network_input = network_input / float(n_vocab)

    network_output = np_utils.to_categorical(network_output)

    return (network_input, network_output)

def create_network(network_input, n_vocab, LSTM_node_count, Dropout_count):
    """ create the structure of the neural network """
    model = Sequential()
    model.add(LSTM(
        LSTM_node_count,
        input_shape=(network_input.shape[1], network_input.shape[2]),
        recurrent_dropout= Dropout_count,
        return_sequences=True
    ))
    model.add(LSTM(
        LSTM_node_count, 
        return_sequences=True, 
        recurrent_dropout= Dropout_count,))
    model.add(LSTM(LSTM_node_count))
    model.add(BatchNorm())
    model.add(Dropout(Dropout_count))
    model.add(Dense(256))
    model.add(Activation('relu'))
    model.add(BatchNorm())
    model.add(Dropout(Dropout_count))
    model.add(Dense(n_vocab))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

    return model

def train(model, network_input, network_output, epoch, batchsize):
    """ train the neural network """
    filepath = "trained_weights/" + "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
    checkpoint = ModelCheckpoint(
        filepath,
        monitor='loss',
        verbose=0,
        save_best_only= True,
        mode='min'
    )
    callbacks_list = [checkpoint]

    model.fit(network_input, 
              network_output, 
              epochs= epoch,
              batch_size= batchsize, 
              callbacks= callbacks_list)

configproto = tf.compat.v1.ConfigProto() 
configproto.gpu_options.allow_growth = True
configproto.gpu_options.polling_inactive_delay_msecs = 10
sess = tf.compat.v1.Session(config=configproto) 
tf.compat.v1.keras.backend.set_session(sess)

在训练过程中，我也收到警告信息，我不知道它是什么意思。

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

C:\Users\David>nvidia-smi
Sun Dec 27 15:56:16 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.89       Driver Version: 460.89       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8    N/A /  N/A |    120MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5496    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A      7372    C+G   ...nputApp\TextInputHost.exe    N/A      |
|    0   N/A  N/A      8268    C+G   ...wekyb3d8bbwe\Music.UI.exe    N/A      |
|    0   N/A  N/A      9420    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     10084    C+G   ...ekyb3d8bbwe\YourPhone.exe    N/A      |
|    0   N/A  N/A     11292    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A     14684    C+G   ...cw5n1h2txyewy\LockApp.exe    N/A      |
+-----------------------------------------------------------------------------+

我目前使用的是tensorflow 2.4, CUDA 11.2,

Answer 1

您使用的 recurrent_dropout > 0 不符合 LSTM 兼容性 requirements 以确保 CuDNN 优化。使 recurrent_dropout = 0 解决问题。

为什么我的 GPU 在训练数据时会中断？

Why is my GPU getting interrupted when training my data?

python

gpu

tensorflow