Keras：使用 model.fit() 洗牌数据不会改变，但 sklearn.train_test_split() 会改变

Question

我是 Keras 的新手，遇到了一个我不明白的问题，到目前为止我在互联网上找不到任何解决方案。

我使用以下几行代码在 UrbanSound8K 数据集上训练一个简单的模型：

x_train, y_train, _, _ = load_data(["data_1.pickle", "data_5.pickle"])
#x_train, _, y_train, _ = train_test_split(x_train, y_train, test_size=0.01, random_state = 0, shuffle=True)

model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

model.fit(x_train, y_train, validation_split=0.2, batch_size=32, epochs=50, shuffle=True)

当我训练这个模型时，它达到了大约 50% 的 val_accuracy。在 model.fit() 中将 shuffle 更改为 False 似乎没有任何影响。但是当我取消注释第二行并使用 x_train, _, y_train, _ = train_test_split(x_train, y_train, test_size=0.01, random_state = 0, shuffle=True) 打乱数据集时，模型达到 val_accuracy 超过 80%！无论 model.fit() shuffle 设置为 True 还是 False。

这怎么可能？在拟合模型之前对数据进行混洗应该不会有任何区别，因为它的训练数据在每个时期之前都会被混洗？还是我理解错了model.fit()的参数shuffle？或者 train_test_split() 中是否还有其他神奇的事情发生？

Answer 1

您正在使用 .2 的验证拆分。现在，根据 model.fit 文档，它指出

 The validation data is selected from the last samples in the x and y data provided, before shuffling.

所以我唯一能想到的是，当你不使用 train_test_split 时，model.fit 使用的验证数据始终是从未打乱的训练数据末尾获取的相同数据。当您使用 train_test_split 时，训练数据会被打乱，因此在这种情况下验证数据会有所不同。如果验证集的大小很小，这可能会对计算的验证准确性产生显着差异，因为两种情况下的验证样本不同。我认为 model.fit 到 select 训练数据末尾的验证数据是不好的做法。它应该 select 从训练数据中随机抽取。即使有相当多的验证样本，如果训练样本末尾的数据具有与其余训练数据明显不同的概率分布，这也可能导致验证准确度低得多。例如，如果您正在对狗与猫进行分类，并且在训练集中最后的所有图像都是猫，那么验证图像将全部是猫。

Keras：使用 model.fit() 洗牌数据不会改变，但 sklearn.train_test_split() 会改变

Keras: Shuffling data using model.fit() doesn't make a change but sklearn.train_test_split() does

python

shuffle

machine-learning

keras

tensorflow