如何对这个 CNN 使用 K 折交叉验证?

How to use K-Fold Cross Validation For This CNN?

我曾尝试为我的二值图像分类器实施 K 折交叉验证,但我已经苦苦挣扎了一段时间,因为我一直在处理整个数据处理方面的问题。在我尝试 K 折之前,我已经在下面包含了我的代码(它很长而且很乱 - 抱歉),因为它出现了可怕的错误。任何建议或支持将不胜感激。我相信在这里使用 K 折是正确的方法,但如果不是,请告诉我。非常感谢!

我想知道如何重新格式化我的数据以创建单独的折叠,因为几乎所有教程都使用 .csv 文件;但是,我只是有两个包含图像的不同文件夹,要么分为两个单独的类别(用于训练数据),要么只有一个单一类别(用于测试数据)。

from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Dropout
from keras.regularizers import l2
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import EarlyStopping
import numpy as np
import matplotlib.pyplot as plt

classifier = Sequential()
classifier.add(Conv2D(32, (3 , 3), input_shape = (256, 256, 3), activation = 'relu', kernel_regularizer=l2(0.01)))
classifier.add(MaxPooling2D(pool_size=(2,2)))
classifier.add(Flatten())
classifier.add(Dense(units = 128, activation='relu'))
classifier.add(Dropout(0.5))
classifier.add(Dense(units=1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, validation_split=0.2)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
    'train',
    target_size=(256, 256),
    batch_size=32,
    class_mode='binary',
    subset='training') # set as training data

validation_generator = train_datagen.flow_from_directory(
    'train', # same directory as training data
    target_size=(256, 256),
    batch_size=32,
    class_mode='binary',
    subset='validation')
test_set = test_datagen.flow_from_directory('test', target_size = (256,256), batch_size=10, class_mode='binary')

history = classifier.fit_generator(train_generator, steps_per_epoch=40, epochs=100, validation_data=validation_generator)
classifier.save('50epochmodel')

test_images = np.array(list(next(test_set)[:1]))[0]
probabilities = classifier.predict(test_images)

这里还没有魔法,只是给生成器写一个包装器或者使用this workaround

总而言之,我建议您创建一个 csv 文件,第一列是图像名称,第二列是标签。

之后:

import pandas as pd
from sklearn.model_selection import KFold

train_data = pd.read_csv('training_labels.csv')
for train_index, val_index in kf.split(np.zeros(n),Y):
    training_data = train_data.iloc[train_index]
    validation_data = train_data.iloc[val_index]
    train_data_generator = idg.flow_from_dataframe(training_data, directory = image_dir,
                               x_col = "filename", y_col = "label",
                               class_mode = "categorical", shuffle = True)
    valid_data_generator  = idg.flow_from_dataframe(validation_data, directory = image_dir,
                            x_col = "filename", y_col = "label",
                            class_mode = "categorical", shuffle = True)
     

为了获得更大的灵活性,您可以使用简单的文件加载功能,而不是使用 Keras 生成器。然后,您可以遍历文件列表并针对剩余的折叠进行测试。

import os
os.chdir(r'catsanddogs')
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras import Sequential
from collections import deque
from glob2 import glob
import numpy as np

files = glob('*\*\*.jpg')
files = files[:-(len(files)%3)] # dataset is now divisible by 3 

indices = np.random.permutation(len(files)).reshape(3, -1)

imsize = 64


def load(file_path):
    img = tf.io.read_file(file_path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, size=(imsize, imsize))
    label = tf.strings.split(file_path, os.sep)[1]
    label = tf.cast(tf.equal(label, 'dogs'), tf.int32)
    return img, label

accuracies_on_test_set = {}

for i in range(len(indices)):
    d = deque(np.array(files)[indices].tolist())
    d.rotate(-i)
    train1, train2, test1 = d
    train_ds = tf.data.Dataset.from_tensor_slices(train1 + train2).\
        shuffle(len(train1) + len(train2)).map(load).batch(4)
    test_ds = tf.data.Dataset.from_tensor_slices(test1).\
        shuffle(len(test1)).map(load).batch(4)

    classifier = Sequential()
    classifier.add(Conv2D(8, (3, 3), input_shape=(imsize, imsize, 3), activation='relu'))
    classifier.add(MaxPooling2D(pool_size=(2, 2)))
    classifier.add(Flatten())
    classifier.add(Dense(units=32, activation='relu'))
    classifier.add(Dropout(0.5))
    classifier.add(Dense(units=1, activation='sigmoid'))
    classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    classifier.fit(train_ds, validation_data=test_ds, epochs=2, verbose=0)
    loss, accuracy = classifier.evaluate(test_ds, verbose=0)
    accuracies_on_test_set[f'epoch_{i + 1}_accuracy'] = accuracy

print(accuracies_on_test_set)
{'epoch_1_accuracy': 0.8235, 'epoch_2_accuracy': 0.7765, 'epoch_3_accuracy': 0.736}

这里是数据集的轮换:

from collections import deque

groups = ['group1', 'group2', 'group3']

for i in range(3):
    d = deque(groups)
    d.rotate(-i)
    print(list(d))
['group1', 'group2', 'group3']
['group2', 'group3', 'group1']
['group3', 'group1', 'group2']

他们都轮流成为最后一个,随后将其作为对所有其他人的测试集:

train1, train2, test1 = d