为什么我的 GPU 在训练数据时会中断?
Why is my GPU getting interrupted when training my data?
我花了几个小时配置我的电脑,最后在 GPU 而不是 CPU 上制作 python 训练数据。但是,由于某种原因,我的模型在它们的 epoch 中间总是被中断,我无法完成模型的训练。
等待电脑并不能解决这个问题,我也无法中断内核。我尝试了其他人的解决方案,但仍然没有太多运气。
如果我使用 CPU(以爬行速度),我可以正常训练我的模型,但是当我切换到 GPU 时,我的模型在它们中途挂断之前训练得非常快,没有完成所有所需的时代。我的 python 内核在那之后也卡在 运行 上,除非我从任务管理器中终止整个事情,否则我无法中断它。
根据我的任务管理器性能历史记录,在训练期间我的 GPU 会持续出现峰值,这是预期的。但是,当它挂断时,我的 GPU activity 会回到 0,即使我的内核指示训练仍处于其纪元的中间。这是随机发生的,不依赖于时间或轮数,尽管我训练数据的时间越长,发生的可能性就越大。
这是我的顺序模型。
def prepare_sequences(notes, n_vocab, seq_len):
""" Prepare the sequences used by the Neural Network """
sequence_length = seq_len
names = sorted(set(item for item in notes))
note_to_int = dict((note, number) for number, note in enumerate(names))
network_input = []
network_output = []
# create input sequences and the corresponding outputs
for i in range(0, len(notes) - sequence_length, 1):
sequence_in = notes[i:i + sequence_length]
sequence_out = notes[i + sequence_length]
network_input.append([note_to_int[char] for char in sequence_in])
network_output.append(note_to_int[sequence_out])
n_patterns = len(network_input)
# reshape the input into a format compatible with LSTM layers
network_input = numpy.reshape(network_input, (n_patterns, sequence_length, 1))
# normalize input
network_input = network_input / float(n_vocab)
network_output = np_utils.to_categorical(network_output)
return (network_input, network_output)
def create_network(network_input, n_vocab, LSTM_node_count, Dropout_count):
""" create the structure of the neural network """
model = Sequential()
model.add(LSTM(
LSTM_node_count,
input_shape=(network_input.shape[1], network_input.shape[2]),
recurrent_dropout= Dropout_count,
return_sequences=True
))
model.add(LSTM(
LSTM_node_count,
return_sequences=True,
recurrent_dropout= Dropout_count,))
model.add(LSTM(LSTM_node_count))
model.add(BatchNorm())
model.add(Dropout(Dropout_count))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(BatchNorm())
model.add(Dropout(Dropout_count))
model.add(Dense(n_vocab))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
return model
def train(model, network_input, network_output, epoch, batchsize):
""" train the neural network """
filepath = "trained_weights/" + "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(
filepath,
monitor='loss',
verbose=0,
save_best_only= True,
mode='min'
)
callbacks_list = [checkpoint]
model.fit(network_input,
network_output,
epochs= epoch,
batch_size= batchsize,
callbacks= callbacks_list)
configproto = tf.compat.v1.ConfigProto()
configproto.gpu_options.allow_growth = True
configproto.gpu_options.polling_inactive_delay_msecs = 10
sess = tf.compat.v1.Session(config=configproto)
tf.compat.v1.keras.backend.set_session(sess)
在训练过程中,我也收到警告信息,我不知道它是什么意思。
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
C:\Users\David>nvidia-smi
Sun Dec 27 15:56:16 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.89 Driver Version: 460.89 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 47C P8 N/A / N/A | 120MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5496 C+G ...5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 7372 C+G ...nputApp\TextInputHost.exe N/A |
| 0 N/A N/A 8268 C+G ...wekyb3d8bbwe\Music.UI.exe N/A |
| 0 N/A N/A 9420 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 10084 C+G ...ekyb3d8bbwe\YourPhone.exe N/A |
| 0 N/A N/A 11292 C+G Insufficient Permissions N/A |
| 0 N/A N/A 14684 C+G ...cw5n1h2txyewy\LockApp.exe N/A |
+-----------------------------------------------------------------------------+
我目前使用的是tensorflow 2.4, CUDA 11.2,
您使用的 recurrent_dropout > 0
不符合 LSTM 兼容性 requirements 以确保 CuDNN 优化。使 recurrent_dropout = 0
解决问题。
我花了几个小时配置我的电脑,最后在 GPU 而不是 CPU 上制作 python 训练数据。但是,由于某种原因,我的模型在它们的 epoch 中间总是被中断,我无法完成模型的训练。
等待电脑并不能解决这个问题,我也无法中断内核。我尝试了其他人的解决方案,但仍然没有太多运气。
如果我使用 CPU(以爬行速度),我可以正常训练我的模型,但是当我切换到 GPU 时,我的模型在它们中途挂断之前训练得非常快,没有完成所有所需的时代。我的 python 内核在那之后也卡在 运行 上,除非我从任务管理器中终止整个事情,否则我无法中断它。
根据我的任务管理器性能历史记录,在训练期间我的 GPU 会持续出现峰值,这是预期的。但是,当它挂断时,我的 GPU activity 会回到 0,即使我的内核指示训练仍处于其纪元的中间。这是随机发生的,不依赖于时间或轮数,尽管我训练数据的时间越长,发生的可能性就越大。
这是我的顺序模型。
def prepare_sequences(notes, n_vocab, seq_len):
""" Prepare the sequences used by the Neural Network """
sequence_length = seq_len
names = sorted(set(item for item in notes))
note_to_int = dict((note, number) for number, note in enumerate(names))
network_input = []
network_output = []
# create input sequences and the corresponding outputs
for i in range(0, len(notes) - sequence_length, 1):
sequence_in = notes[i:i + sequence_length]
sequence_out = notes[i + sequence_length]
network_input.append([note_to_int[char] for char in sequence_in])
network_output.append(note_to_int[sequence_out])
n_patterns = len(network_input)
# reshape the input into a format compatible with LSTM layers
network_input = numpy.reshape(network_input, (n_patterns, sequence_length, 1))
# normalize input
network_input = network_input / float(n_vocab)
network_output = np_utils.to_categorical(network_output)
return (network_input, network_output)
def create_network(network_input, n_vocab, LSTM_node_count, Dropout_count):
""" create the structure of the neural network """
model = Sequential()
model.add(LSTM(
LSTM_node_count,
input_shape=(network_input.shape[1], network_input.shape[2]),
recurrent_dropout= Dropout_count,
return_sequences=True
))
model.add(LSTM(
LSTM_node_count,
return_sequences=True,
recurrent_dropout= Dropout_count,))
model.add(LSTM(LSTM_node_count))
model.add(BatchNorm())
model.add(Dropout(Dropout_count))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(BatchNorm())
model.add(Dropout(Dropout_count))
model.add(Dense(n_vocab))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
return model
def train(model, network_input, network_output, epoch, batchsize):
""" train the neural network """
filepath = "trained_weights/" + "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(
filepath,
monitor='loss',
verbose=0,
save_best_only= True,
mode='min'
)
callbacks_list = [checkpoint]
model.fit(network_input,
network_output,
epochs= epoch,
batch_size= batchsize,
callbacks= callbacks_list)
configproto = tf.compat.v1.ConfigProto()
configproto.gpu_options.allow_growth = True
configproto.gpu_options.polling_inactive_delay_msecs = 10
sess = tf.compat.v1.Session(config=configproto)
tf.compat.v1.keras.backend.set_session(sess)
在训练过程中,我也收到警告信息,我不知道它是什么意思。
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
C:\Users\David>nvidia-smi
Sun Dec 27 15:56:16 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.89 Driver Version: 460.89 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 47C P8 N/A / N/A | 120MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5496 C+G ...5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 7372 C+G ...nputApp\TextInputHost.exe N/A |
| 0 N/A N/A 8268 C+G ...wekyb3d8bbwe\Music.UI.exe N/A |
| 0 N/A N/A 9420 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 10084 C+G ...ekyb3d8bbwe\YourPhone.exe N/A |
| 0 N/A N/A 11292 C+G Insufficient Permissions N/A |
| 0 N/A N/A 14684 C+G ...cw5n1h2txyewy\LockApp.exe N/A |
+-----------------------------------------------------------------------------+
我目前使用的是tensorflow 2.4, CUDA 11.2,
您使用的 recurrent_dropout > 0
不符合 LSTM 兼容性 requirements 以确保 CuDNN 优化。使 recurrent_dropout = 0
解决问题。