为什么使用 ModelCheckpoint 对已保存模型的评估与训练历史中的结果不同?

Why evaluation of saved model by using ModelCheckpoint is different from results in training history?

我的代码如下:

from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
import numpy as np

best_weights_filepath = './best_weights.hdf5'

labels = np.array([1, 2]) # 0 - num_classes - 1
y_train = np_utils.to_categorical(labels, 3)
X_train = np.array([[[1, 2], [3, 4]], [[1, 2], [3, 4]]])

model = Sequential()
model.add(Flatten(input_shape=X_train.shape[1:]))
model.add(Dropout(0.2))
model.add(Dense(64))
model.add(Dropout(0.15))
model.add(Dense(32))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
mcp = ModelCheckpoint(best_weights_filepath, monitor="loss",
                      save_best_only=True)
hist = model.fit(X_train, y_train, 32, 50, callbacks=[mcp])                      
print(hist.history)
model.load_weights(best_weights_filepath)
evaluation = model.evaluate(X_train, y_train)
print(evaluation)

历史:

{'loss': [0.88553774356842041, 1.3095510005950928, 1.0029082298278809, 0.93805015087127686, 0.91467124223709106, 1.2132010459899902, 1.0659240484237671, 0.70151412487030029, 1.1300414800643921, 0.94646221399307251, 0.85309064388275146, 0.79526293277740479, 0.70288115739822388, 1.1289818286895752, 0.87788408994674683, 0.63794469833374023, 0.92958927154541016, 0.63434022665023804, 0.26608449220657349, 1.133800745010376, 0.45052343606948853, 0.29425695538520813, 1.3438365459442139, 1.6920032501220703, 1.1263372898101807, 0.78767621517181396, 1.8708134889602661, 0.39164793491363525, 1.9281209707260132, 0.56522297859191895, 0.97685378789901733, 0.73725700378417969, 0.55782550573348999, 1.0230169296264648, 0.63401424884796143, 0.27007108926773071, 1.3010811805725098, 0.58272790908813477, 0.62068361043930054, 0.85791635513305664, 1.2364600896835327, 0.55607849359512329, 1.382312536239624, 1.0019338130950928, 0.24319441616535187, 0.76683026552200317, 0.99913954734802246, 0.57584917545318604, 0.78851628303527832, 1.8757588863372803]}

保存模型的评价:

0.698137879372

我想知道为什么保存的最佳模型的评估与历史最佳损失不同?

附加信息:

我尝试使用以下代码保存有关迭代和损失的信息:

mcp = ModelCheckpoint(filepath='./{epoch:d}_{loss:.5f}.hdf5', monitor="loss",
                      save_best_only=True)

并有下一个文件:

0_1.71130.hdf5 2_0.39069.hdf5 17_0.25475.hdf5 20_0.15824.hdf5

哪些对应训练输出:

Epoch 21/50 2/2 [==============================] - 0s - loss: 0.1582

但加载最佳模型后:

best_weights_filepath = "20_0.15824.hdf5"
model.load_weights(best_weights_filepath)
evaluation = model.evaluate(X_train, y_train)
print(evaluation)

结果:

0.792584061623

根据 Josef Korbel 的建议更新:

检查 shuffle = False。我更改了这行代码:

hist = model.fit(X_train, y_train, 32, 50, callbacks=[mcp], shuffle = False) 

历史:

{'loss': [1.0125206708908081, 0.1452154815196991, 0.51181155443191528, 0.56420713663101196, 0.84724342823028564, 1.1929426193237305, 0.29997271299362183, 0.75090807676315308, 0.85906744003295898, 1.2877860069274902, 1.8168995380401611, 0.25087261199951172, 0.67293435335159302, 0.036234244704246521, 1.5076791048049927, 0.87120181322097778, 0.68330782651901245, 2.0751430988311768, 0.82240021228790283, 0.60692423582077026, 0.37373599410057068, 0.3232136070728302, 0.80889785289764404, 0.096551664173603058, 0.37592190504074097, 0.72723108530044556, 0.21966041624546051, 1.0940688848495483, 0.68471181392669678, 0.68382972478866577, 0.5214000940322876, 0.82752323150634766, 0.12418889999389648, 0.079014614224433899, 0.27435758709907532, 0.25825804471969604, 1.3681017160415649, 1.7907644510269165, 0.39580270648002625, 1.4243916273117065, 0.14836907386779785, 0.3069019615650177, 1.4323314428329468, 0.42189797759056091, 0.047193970531225204, 0.47303882241249084, 0.62194353342056274, 0.284626305103302, 1.8536494970321655, 0.73895668983459473]}

保存模型的评价:

0.356727153063

最佳文件:

43_0.19047.hdf5

加载此文件后的评价:

0.373612910509

检查验证

验证代码:

from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
import numpy as np

best_weights_filepath = './best_weights.hdf5'

train_labels = np.array([1, 2]) # 0 - num_classes - 1
y_train = np_utils.to_categorical(train_labels, 3)
X_train = np.array([[[1, 2], [3, 4]], [[2, 1], [4, 3]]])

test_labels = np.array([0, 2]) # 0 - num_classes - 1
y_test = np_utils.to_categorical(test_labels, 3)
X_test = np.array([[[2, 2], [3, 3]], [[1, 1], [4, 4]]])

model = Sequential()
model.add(Flatten(input_shape=X_train.shape[1:]))
model.add(Dropout(0.2))
model.add(Dense(64))
model.add(Dropout(0.15))
model.add(Dense(32))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
mcp = ModelCheckpoint(filepath=best_weights_filepath, monitor='val_loss',
                      save_best_only=True)
hist = model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=50, callbacks=[mcp], shuffle = False)
print(hist.history)
model.load_weights(best_weights_filepath)
evaluation = model.evaluate(X_test, y_test)
print(evaluation)
print(model.metrics_names)

历史:

{'loss': [3.4101266860961914, 2.2727742195129395, 0.82779181003570557, 1.3179346323013306, 1.5904533863067627, 0.60796171426773071, 0.93778908252716064, 1.5920863151550293, 0.9363548755645752, 0.77552896738052368, 0.87378394603729248, 2.1034069061279297, 0.40709391236305237, 0.87646675109863281, 0.072320356965065002, 0.70467042922973633, 0.89934390783309937, 0.26884844899177551, 0.87511622905731201, 0.40567696094512939, 1.6750704050064087, 0.37005302309989929, 0.36293312907218933, 0.94361913204193115, 0.19056390225887299, 1.3764189481735229, 0.25876694917678833, 0.55998247861862183, 1.0649962425231934, 2.1643946170806885, 0.2727261483669281, 1.2005348205566406, 1.0628913640975952, 1.572542667388916, 0.22350168228149414, 0.37423995137214661, 0.7491459846496582, 0.51720428466796875, 0.86196297407150269, 0.72071665525436401, 0.7442132830619812, 0.83153235912322998, 0.045838892459869385, 0.037082117050886154, 0.68096923828125, 0.35572469234466553, 1.4226186275482178, 0.40259963274002075, 0.4162265956401825, 0.29243966937065125], 'val_loss': [2.0877130031585693, 1.3081772327423096, 1.0912094116210937, 1.4002015590667725, 1.1119445562362671, 1.2372562885284424, 1.4829056262969971, 1.3195570707321167, 1.6970505714416504, 1.8137892484664917, 2.6280913352966309, 1.6495449542999268, 1.9247033596038818, 1.8289017677307129, 1.9001308679580688, 1.7850335836410522, 1.903494119644165, 1.8801615238189697, 1.8557041883468628, 1.901431679725647, 2.1235334873199463, 2.1267158985137939, 2.1307065486907959, 2.3799698352813721, 2.6747565269470215, 2.5206508636474609, 2.3310909271240234, 2.6511917114257812, 2.4436931610107422, 2.560744047164917, 2.5082297325134277, 2.3821530342102051, 2.4538085460662842, 2.5820655822753906, 2.5825791358947754, 2.8093762397766113, 2.5358507633209229, 2.4986701011657715, 3.152174711227417, 2.7431669235229492, 2.841381311416626, 2.5363466739654541, 2.5489804744720459, 2.5466430187225342, 2.577369213104248, 2.679440975189209, 2.5890841484069824, 2.7041923999786377, 2.6547081470489502, 2.6690154075622559]}

保存模型的评价:

1.09120941162

看起来它可以用于验证

检查文件:

4_2.19177.hdf5

加载此文件后的评价:

2.19177055359

因为您正在监控 loss,这意味着训练数据集的损失。验证数据集的损失称为 val_loss。我不知道这是否是您提供的实际代码,但您不应在您训练过的同一数据集上进行评估。它可能不会泛化任何东西,只是开始记住输入数据和过度拟合,尤其是在小数据集上。

为什么评价比best saved training loss差? 这是因为损失数组是在每个 epoch 结束时计算的,如果你有 shuffle=True,每个 epochbatch 的顺序都会不同,因此梯度的计算方式也会不同。这可以带来不同。另一方面,评估一次处理整个集合(batch_size 批次),但同样,不要对同一数据集使用评估,那样你将很难确定网络的准确性。