Keras --- fit_generator() 期间训练冻结
Keras --- Training freezes during fit_generator()
我正在尝试训练我的 6000 个训练数据集和 1000 个验证数据集,但我遇到了一个问题:程序在训练期间冻结并挂起,没有任何错误消息。
1970/6000 [========>.....................] - ETA: 1:50:11 - loss: 1.2256 - accuracy: 0.5956
1971/6000 [========>.....................] - ETA: 1:50:08 - loss: 1.2252 - accuracy: 0.5958
1972/6000 [========>.....................] - ETA: 1:50:08 - loss: 1.2248 - accuracy: 0.5960
1973/6000 [========>.....................] - ETA: 1:50:06 - loss: 1.2245 - accuracy: 0.5962
1974/6000 [========>.....................] - ETA: 1:50:04 - loss: 1.2241 - accuracy: 0.5964
1975/6000 [========>.....................] - ETA: 1:50:02 - loss: 1.2243 - accuracy: 0.5961
1976/6000 [========>.....................] - ETA: 1:50:00 - loss: 1.2239 - accuracy: 0.5963
1977/6000 [========>.....................] - ETA: 1:49:58 - loss: 1.2236 - accuracy: 0.5965
1978/6000 [========>.....................] - ETA: 1:49:57 - loss: 1.2241 - accuracy: 0.5962
1979/6000 [========>.....................] - ETA: 1:49:56 - loss: 1.2237 - accuracy: 0.5964
1980/6000 [========>.....................] - ETA: 1:49:55 - loss: 1.2242 - accuracy: 0.5961
1981/6000 [========>.....................] - ETA: 1:49:53 - loss: 1.2252 - accuracy: 0.5958
1982/6000 [========>.....................] - ETA: 1:49:52 - loss: 1.2257 - accuracy: 0.5955
我等了 5-6 分钟,但似乎什么也没发生。
我试着解决
- 将 steps_per_epoch 更改为 100 并将 epoch 增加到 20
- 我认为这是 ReduceLROnPlateau 函数的问题所以我会添加冷却 =1
但是2个方案都没有解决这个问题
硬件配置:
- I5-8300h
- Gtx 1060 6GB
依赖关系:
- 凯拉斯 2.3.1
- TensorFlow 2.0.0(GPU 版本)
代码如下:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras
import tensorflow as tf
from skimage import exposure, color
from keras.optimizers import Adam
from tqdm import tqdm
from keras.models import Model
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D,Convolution2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, Callback
from keras import regularizers
from keras.applications.densenet import DenseNet121
from keras_preprocessing.image import ImageDataGenerator
from sklearn.utils import class_weight
from collections import Counter
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
session = tf.compat.v1.Session(config=config)
# Histogram equalization
def HE(img):
img_eq = exposure.equalize_hist(img)
return img_eq
def plotImages(images_arr):
fig, axes = plt.subplots(1, 5, figsize=(20,20))
axes = axes.flatten()
for img, ax in zip( images_arr, axes):
ax.imshow(img)
ax.axis('off')
plt.tight_layout()
plt.show()
train_datagen = ImageDataGenerator(
rescale=1. / 255,
rotation_range=40,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest',
preprocessing_function=HE,
)
validation_datagen = ImageDataGenerator(
rescale=1./255
)
test_datagen = ImageDataGenerator(
rescale=1./255
)
#get image and label with augmentation
train = train_datagen.flow_from_directory(
'train/train_deep/',
target_size=(224,224),
class_mode='categorical',
shuffle=False,
batch_size = 20,
)
test = test_datagen.flow_from_directory(
'test_deep/',
batch_size=1,
target_size = (224,224),
)
val = validation_datagen.flow_from_directory(
'train/validate_deep/',
target_size=(224,224),
batch_size = 20,
)
#Training
X_train, y_train = next(train)
class_names = ['No DR', 'Mild', 'Moderate', 'Severe', 'Proliferative DR']
counter = Counter(train.classes)
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train.classes),
train.classes)
#X_test , y_test = next(test)
#X_test=np.reshape(X_test,(X_test.shape[0],X_test.shape[1],X_test.shape[2]))
#Training parameter
batch_size =32
Epoch = 2
model = DenseNet121(include_top=True, weights=None, input_tensor=None, input_shape=(224,224,3), pooling=None, classes=5)
model.compile(loss='categorical_crossentropy',
optimizer=Adam(learning_rate=0.01),
metrics=['accuracy'])
model.summary()
filepath="weights-improvement-{epoch:02d}-{val_loss:.2f}.hdf5"
checkpointer = ModelCheckpoint(filepath,monitor='val_loss', verbose=1, save_best_only=True,save_weights_only=True)
lr_reduction = ReduceLROnPlateau(monitor='val_loss', patience=5, verbose=2, factor=0.2,cooldown=1)
callbacks_list = [checkpointer, lr_reduction]
#Validation
X_val , y_val = next(val)
#history = model.fit(X_train,y_train,epochs=Epoch,validation_data = (X_val,y_val))
history = model.fit_generator(
train,
epochs=Epoch,
steps_per_epoch=6000,
class_weight=class_weights,
validation_data=val,
validation_steps=1000,
use_multiprocessing = False,
max_queue_size=100,
workers = 1,
callbacks=callbacks_list
)
# Score trained model.
scores = model.evaluate(X_val, y_val, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])
#predict
test.reset()
pred=model.predict_generator(test,
steps=25,)
print(pred)
for i in pred:
print(np.argmax(i))
如果您使用 Keras < 2.0.0(我不建议您使用旧版本),此代码将运行良好。
您的错误是因为您在 TensorFlow 中使用的是 Keras > 2.0.0 或 Keras。
您代码中的确切错误来自这些行:
history = model.fit_generator( #change `.fit_generator() to .fit()`
train,
epochs=Epoch,
steps_per_epoch=6000, #change this to 6000//32
class_weight=class_weights,
validation_data=val,
validation_steps=1000, #change this to 1000//32
use_multiprocessing = False,
max_queue_size=100,
workers = 1,
callbacks=callbacks_list
)
参数“steps_per_epoch
”和“validation_steps
”必须等于数据集的长度除以批量大小。
我正在尝试训练我的 6000 个训练数据集和 1000 个验证数据集,但我遇到了一个问题:程序在训练期间冻结并挂起,没有任何错误消息。
1970/6000 [========>.....................] - ETA: 1:50:11 - loss: 1.2256 - accuracy: 0.5956
1971/6000 [========>.....................] - ETA: 1:50:08 - loss: 1.2252 - accuracy: 0.5958
1972/6000 [========>.....................] - ETA: 1:50:08 - loss: 1.2248 - accuracy: 0.5960
1973/6000 [========>.....................] - ETA: 1:50:06 - loss: 1.2245 - accuracy: 0.5962
1974/6000 [========>.....................] - ETA: 1:50:04 - loss: 1.2241 - accuracy: 0.5964
1975/6000 [========>.....................] - ETA: 1:50:02 - loss: 1.2243 - accuracy: 0.5961
1976/6000 [========>.....................] - ETA: 1:50:00 - loss: 1.2239 - accuracy: 0.5963
1977/6000 [========>.....................] - ETA: 1:49:58 - loss: 1.2236 - accuracy: 0.5965
1978/6000 [========>.....................] - ETA: 1:49:57 - loss: 1.2241 - accuracy: 0.5962
1979/6000 [========>.....................] - ETA: 1:49:56 - loss: 1.2237 - accuracy: 0.5964
1980/6000 [========>.....................] - ETA: 1:49:55 - loss: 1.2242 - accuracy: 0.5961
1981/6000 [========>.....................] - ETA: 1:49:53 - loss: 1.2252 - accuracy: 0.5958
1982/6000 [========>.....................] - ETA: 1:49:52 - loss: 1.2257 - accuracy: 0.5955
我等了 5-6 分钟,但似乎什么也没发生。 我试着解决
- 将 steps_per_epoch 更改为 100 并将 epoch 增加到 20
- 我认为这是 ReduceLROnPlateau 函数的问题所以我会添加冷却 =1
但是2个方案都没有解决这个问题
硬件配置:
- I5-8300h
- Gtx 1060 6GB
依赖关系:
- 凯拉斯 2.3.1
- TensorFlow 2.0.0(GPU 版本)
代码如下:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras
import tensorflow as tf
from skimage import exposure, color
from keras.optimizers import Adam
from tqdm import tqdm
from keras.models import Model
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D,Convolution2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, Callback
from keras import regularizers
from keras.applications.densenet import DenseNet121
from keras_preprocessing.image import ImageDataGenerator
from sklearn.utils import class_weight
from collections import Counter
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
session = tf.compat.v1.Session(config=config)
# Histogram equalization
def HE(img):
img_eq = exposure.equalize_hist(img)
return img_eq
def plotImages(images_arr):
fig, axes = plt.subplots(1, 5, figsize=(20,20))
axes = axes.flatten()
for img, ax in zip( images_arr, axes):
ax.imshow(img)
ax.axis('off')
plt.tight_layout()
plt.show()
train_datagen = ImageDataGenerator(
rescale=1. / 255,
rotation_range=40,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest',
preprocessing_function=HE,
)
validation_datagen = ImageDataGenerator(
rescale=1./255
)
test_datagen = ImageDataGenerator(
rescale=1./255
)
#get image and label with augmentation
train = train_datagen.flow_from_directory(
'train/train_deep/',
target_size=(224,224),
class_mode='categorical',
shuffle=False,
batch_size = 20,
)
test = test_datagen.flow_from_directory(
'test_deep/',
batch_size=1,
target_size = (224,224),
)
val = validation_datagen.flow_from_directory(
'train/validate_deep/',
target_size=(224,224),
batch_size = 20,
)
#Training
X_train, y_train = next(train)
class_names = ['No DR', 'Mild', 'Moderate', 'Severe', 'Proliferative DR']
counter = Counter(train.classes)
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train.classes),
train.classes)
#X_test , y_test = next(test)
#X_test=np.reshape(X_test,(X_test.shape[0],X_test.shape[1],X_test.shape[2]))
#Training parameter
batch_size =32
Epoch = 2
model = DenseNet121(include_top=True, weights=None, input_tensor=None, input_shape=(224,224,3), pooling=None, classes=5)
model.compile(loss='categorical_crossentropy',
optimizer=Adam(learning_rate=0.01),
metrics=['accuracy'])
model.summary()
filepath="weights-improvement-{epoch:02d}-{val_loss:.2f}.hdf5"
checkpointer = ModelCheckpoint(filepath,monitor='val_loss', verbose=1, save_best_only=True,save_weights_only=True)
lr_reduction = ReduceLROnPlateau(monitor='val_loss', patience=5, verbose=2, factor=0.2,cooldown=1)
callbacks_list = [checkpointer, lr_reduction]
#Validation
X_val , y_val = next(val)
#history = model.fit(X_train,y_train,epochs=Epoch,validation_data = (X_val,y_val))
history = model.fit_generator(
train,
epochs=Epoch,
steps_per_epoch=6000,
class_weight=class_weights,
validation_data=val,
validation_steps=1000,
use_multiprocessing = False,
max_queue_size=100,
workers = 1,
callbacks=callbacks_list
)
# Score trained model.
scores = model.evaluate(X_val, y_val, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])
#predict
test.reset()
pred=model.predict_generator(test,
steps=25,)
print(pred)
for i in pred:
print(np.argmax(i))
如果您使用 Keras < 2.0.0(我不建议您使用旧版本),此代码将运行良好。
您的错误是因为您在 TensorFlow 中使用的是 Keras > 2.0.0 或 Keras。
您代码中的确切错误来自这些行:
history = model.fit_generator( #change `.fit_generator() to .fit()`
train,
epochs=Epoch,
steps_per_epoch=6000, #change this to 6000//32
class_weight=class_weights,
validation_data=val,
validation_steps=1000, #change this to 1000//32
use_multiprocessing = False,
max_queue_size=100,
workers = 1,
callbacks=callbacks_list
)
参数“steps_per_epoch
”和“validation_steps
”必须等于数据集的长度除以批量大小。