Keras ImageDataGenerator validation_split 没有按预期拆分验证数据

Keras ImageDataGenerator validation_split does not split validation data as expected

我正在尝试使用 Tensorflow 和 Keras 学习机器学习中的计算机视觉

我有一个目录,其中包含我从 https://www.kaggle.com/datasets/smaranjitghose/corn-or-maize-leaf-disease-dataset 获得的 4185 张图像 (我故意删除了3张图片)

我有这个包含 listdir() 的代码来检查它是否正确:

import os
folders = os.listdir('/tmp/datasets/data')
print(f'folders: {folders}')

total_images = 0
for f in folders:
  total_images += len(os.listdir(f'/tmp/datasets/data/{f}'))

print(f'Total Images found: {total_images}')

输出如下:

folders: ['Blight', 'Common_Rust', 'Gray_Leaf_Spot', 'Healthy']
Total Images found: 4185

我想用 Keras 的 ImageDataGenerator 将它分成 80% 的训练集和 20% 的验证集

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rescale = 1./255,
    fill_mode='nearest',
    width_shift_range = 0.05,
    height_shift_range = 0.05,
    rotation_range = 45,
    shear_range = 0.1,
    zoom_range=0.2,
    horizontal_flip = True,
    vertical_flip = True,
    validation_split = 0.2,
)

val_datagen = ImageDataGenerator(
    rescale = 1./255,
    validation_split = 0.2
)

train_images = datagen.flow_from_directory('/tmp/datasets/data',
    target_size=(150,150),
    batch_size=32,
    seed=42,
    subset='training',
    class_mode='categorical'
)

val_images = val_datagen.flow_from_directory('/tmp/datasets/data',
    target_size=(150,150), 
    batch_size=32, 
    seed=42,
    subset='validation', 
    class_mode='categorical'
)

以下是 flow_from_directory() 记录的输出:

Found 3350 images belonging to 4 classes.
Found 835 images belonging to 4 classes.

完成的拆分不是预期的 3348 | 837 (0.2 * 4185 = 837),我是不是错过了什么?还是我误解了参数 validation_split?

数据拆分为 每个 文件夹 (class) 和 not 整个数据集。查看源代码 here and here 以了解更多信息。这是 flow_from_directory 内部所做的示例:

import os

folders = os.listdir('/content/data')
print(f'folders: {folders}')

total_images = 0
names = []
paths = [] 
white_list_formats = ('png', 'jpg', 'jpeg', 'bmp', 'ppm', 'tif', 'tiff')
for f in folders:
  paths.append(os.listdir(f'/content/data/{f}'))
  for d in os.listdir(f'/content/data/{f}'):
    if d.lower().endswith(white_list_formats):
      names.append(d)

print(f'Total number of valid images found: {len(names)}')
folders: ['Blight', 'Healthy', 'Common_Rust', 'Gray_Leaf_Spot']
Total number of valid images found: 4188

按文件夹拆分数据:

training_samples = 0
for p in paths:
  split = (0.2, 1)
  num_files = len(p)
  start, stop = int(split[0] * num_files), int(split[1] * num_files)
  valid_files = p[start: stop]
  training_samples += len(valid_files)
print(training_samples)


validation_samples = 0
for p in paths:
  split = (0, 0.2)
  num_files = len(p)
  start, stop = int(split[0] * num_files), int(split[1] * num_files)
  valid_files = p[start: stop]
  validation_samples += len(valid_files)
print(validation_samples)
3352
836

这对应于您从 flow_from_directory 中看到的内容:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rescale = 1./255,
    fill_mode='nearest',
    width_shift_range = 0.05,
    height_shift_range = 0.05,
    rotation_range = 45,
    shear_range = 0.1,
    zoom_range=0.2,
    horizontal_flip = True,
    vertical_flip = True,
    validation_split = 0.2,
)

val_datagen = ImageDataGenerator(
    rescale = 1./255,
    validation_split = 0.2
)

train_images = datagen.flow_from_directory('/content/data',
    target_size=(150,150),
    batch_size=32,
    seed=42,
    subset='training',
    shuffle=False,
    class_mode='categorical'
)

val_images = val_datagen.flow_from_directory('/content/data',
    target_size=(150,150), 
    batch_size=32, 
    seed=42,
    subset='validation', 
    shuffle=False,
    class_mode='categorical'
)
Found 3352 images belonging to 4 classes.
Found 836 images belonging to 4 classes.

请注意,我没有像您那样删除 3 张图片,但逻辑保持不变。