为什么我的 GPU 在训练数据时会中断?

Why is my GPU getting interrupted when training my data?

我花了几个小时配置我的电脑,最后在 GPU 而不是 CPU 上制作 python 训练数据。但是,由于某种原因,我的模型在它们的 epoch 中间总是被中断,我无法完成模型的训练。

等待电脑并不能解决这个问题,我也无法中断内核。我尝试了其他人的解决方案,但仍然没有太多运气。

如果我使用 CPU(以爬行速度),我可以正常训练我的模型,但是当我切换到 GPU 时,我的模型在它们中途挂断之前训练得非常快,没有完成所有所需的时代。我的 python 内核在那之后也卡在 运行 上,除非我从任务管理器中终止整个事情,否则我无法中断它。

根据我的任务管理器性能历史记录,在训练期间我的 GPU 会持续出现峰值,这是预期的。但是,当它挂断时,我的 GPU activity 会回到 0,即使我的内核指示训练仍处于其纪元的中间。这是随机发生的,不依赖于时间或轮数,尽管我训练数据的时间越长,发生的可能性就越大。

这是我的顺序模型。

def prepare_sequences(notes, n_vocab, seq_len):
    """ Prepare the sequences used by the Neural Network """
    sequence_length = seq_len

    names = sorted(set(item for item in notes))
    note_to_int = dict((note, number) for number, note in enumerate(names))

    network_input = []
    network_output = []

    # create input sequences and the corresponding outputs
    for i in range(0, len(notes) - sequence_length, 1):
        sequence_in = notes[i:i + sequence_length]
        sequence_out = notes[i + sequence_length]
        network_input.append([note_to_int[char] for char in sequence_in])
        network_output.append(note_to_int[sequence_out])

    n_patterns = len(network_input)

    # reshape the input into a format compatible with LSTM layers
    network_input = numpy.reshape(network_input, (n_patterns, sequence_length, 1))
    # normalize input
    network_input = network_input / float(n_vocab)

    network_output = np_utils.to_categorical(network_output)

    return (network_input, network_output)

def create_network(network_input, n_vocab, LSTM_node_count, Dropout_count):
    """ create the structure of the neural network """
    model = Sequential()
    model.add(LSTM(
        LSTM_node_count,
        input_shape=(network_input.shape[1], network_input.shape[2]),
        recurrent_dropout= Dropout_count,
        return_sequences=True
    ))
    model.add(LSTM(
        LSTM_node_count, 
        return_sequences=True, 
        recurrent_dropout= Dropout_count,))
    model.add(LSTM(LSTM_node_count))
    model.add(BatchNorm())
    model.add(Dropout(Dropout_count))
    model.add(Dense(256))
    model.add(Activation('relu'))
    model.add(BatchNorm())
    model.add(Dropout(Dropout_count))
    model.add(Dense(n_vocab))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

    return model

def train(model, network_input, network_output, epoch, batchsize):
    """ train the neural network """
    filepath = "trained_weights/" + "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
    checkpoint = ModelCheckpoint(
        filepath,
        monitor='loss',
        verbose=0,
        save_best_only= True,
        mode='min'
    )
    callbacks_list = [checkpoint]

    model.fit(network_input, 
              network_output, 
              epochs= epoch,
              batch_size= batchsize, 
              callbacks= callbacks_list)
configproto = tf.compat.v1.ConfigProto() 
configproto.gpu_options.allow_growth = True
configproto.gpu_options.polling_inactive_delay_msecs = 10
sess = tf.compat.v1.Session(config=configproto) 
tf.compat.v1.keras.backend.set_session(sess)

在训练过程中,我也收到警告信息,我不知道它是什么意思。

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
C:\Users\David>nvidia-smi
Sun Dec 27 15:56:16 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.89       Driver Version: 460.89       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8    N/A /  N/A |    120MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5496    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A      7372    C+G   ...nputApp\TextInputHost.exe    N/A      |
|    0   N/A  N/A      8268    C+G   ...wekyb3d8bbwe\Music.UI.exe    N/A      |
|    0   N/A  N/A      9420    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     10084    C+G   ...ekyb3d8bbwe\YourPhone.exe    N/A      |
|    0   N/A  N/A     11292    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A     14684    C+G   ...cw5n1h2txyewy\LockApp.exe    N/A      |
+-----------------------------------------------------------------------------+

我目前使用的是tensorflow 2.4, CUDA 11.2,

您使用的 recurrent_dropout > 0 不符合 LSTM 兼容性 requirements 以确保 CuDNN 优化。使 recurrent_dropout = 0 解决问题。