Keras ImageDataGenerator validation_split 没有按预期拆分验证数据
Keras ImageDataGenerator validation_split does not split validation data as expected
我正在尝试使用 Tensorflow 和 Keras 学习机器学习中的计算机视觉
我有一个目录,其中包含我从 https://www.kaggle.com/datasets/smaranjitghose/corn-or-maize-leaf-disease-dataset 获得的 4185 张图像
(我故意删除了3张图片)
我有这个包含 listdir()
的代码来检查它是否正确:
import os
folders = os.listdir('/tmp/datasets/data')
print(f'folders: {folders}')
total_images = 0
for f in folders:
total_images += len(os.listdir(f'/tmp/datasets/data/{f}'))
print(f'Total Images found: {total_images}')
输出如下:
folders: ['Blight', 'Common_Rust', 'Gray_Leaf_Spot', 'Healthy']
Total Images found: 4185
我想用 Keras 的 ImageDataGenerator 将它分成 80% 的训练集和 20% 的验证集
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rescale = 1./255,
fill_mode='nearest',
width_shift_range = 0.05,
height_shift_range = 0.05,
rotation_range = 45,
shear_range = 0.1,
zoom_range=0.2,
horizontal_flip = True,
vertical_flip = True,
validation_split = 0.2,
)
val_datagen = ImageDataGenerator(
rescale = 1./255,
validation_split = 0.2
)
train_images = datagen.flow_from_directory('/tmp/datasets/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='training',
class_mode='categorical'
)
val_images = val_datagen.flow_from_directory('/tmp/datasets/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='validation',
class_mode='categorical'
)
以下是 flow_from_directory()
记录的输出:
Found 3350 images belonging to 4 classes.
Found 835 images belonging to 4 classes.
完成的拆分不是预期的 3348 | 837 (0.2 * 4185 = 837)
,我是不是错过了什么?还是我误解了参数 validation_split
?
数据拆分为 每个 文件夹 (class) 和 not 整个数据集。查看源代码 here and here 以了解更多信息。这是 flow_from_directory
内部所做的示例:
import os
folders = os.listdir('/content/data')
print(f'folders: {folders}')
total_images = 0
names = []
paths = []
white_list_formats = ('png', 'jpg', 'jpeg', 'bmp', 'ppm', 'tif', 'tiff')
for f in folders:
paths.append(os.listdir(f'/content/data/{f}'))
for d in os.listdir(f'/content/data/{f}'):
if d.lower().endswith(white_list_formats):
names.append(d)
print(f'Total number of valid images found: {len(names)}')
folders: ['Blight', 'Healthy', 'Common_Rust', 'Gray_Leaf_Spot']
Total number of valid images found: 4188
按文件夹拆分数据:
training_samples = 0
for p in paths:
split = (0.2, 1)
num_files = len(p)
start, stop = int(split[0] * num_files), int(split[1] * num_files)
valid_files = p[start: stop]
training_samples += len(valid_files)
print(training_samples)
validation_samples = 0
for p in paths:
split = (0, 0.2)
num_files = len(p)
start, stop = int(split[0] * num_files), int(split[1] * num_files)
valid_files = p[start: stop]
validation_samples += len(valid_files)
print(validation_samples)
3352
836
这对应于您从 flow_from_directory
中看到的内容:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rescale = 1./255,
fill_mode='nearest',
width_shift_range = 0.05,
height_shift_range = 0.05,
rotation_range = 45,
shear_range = 0.1,
zoom_range=0.2,
horizontal_flip = True,
vertical_flip = True,
validation_split = 0.2,
)
val_datagen = ImageDataGenerator(
rescale = 1./255,
validation_split = 0.2
)
train_images = datagen.flow_from_directory('/content/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='training',
shuffle=False,
class_mode='categorical'
)
val_images = val_datagen.flow_from_directory('/content/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='validation',
shuffle=False,
class_mode='categorical'
)
Found 3352 images belonging to 4 classes.
Found 836 images belonging to 4 classes.
请注意,我没有像您那样删除 3 张图片,但逻辑保持不变。
我正在尝试使用 Tensorflow 和 Keras 学习机器学习中的计算机视觉
我有一个目录,其中包含我从 https://www.kaggle.com/datasets/smaranjitghose/corn-or-maize-leaf-disease-dataset 获得的 4185 张图像 (我故意删除了3张图片)
我有这个包含 listdir()
的代码来检查它是否正确:
import os
folders = os.listdir('/tmp/datasets/data')
print(f'folders: {folders}')
total_images = 0
for f in folders:
total_images += len(os.listdir(f'/tmp/datasets/data/{f}'))
print(f'Total Images found: {total_images}')
输出如下:
folders: ['Blight', 'Common_Rust', 'Gray_Leaf_Spot', 'Healthy']
Total Images found: 4185
我想用 Keras 的 ImageDataGenerator 将它分成 80% 的训练集和 20% 的验证集
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rescale = 1./255,
fill_mode='nearest',
width_shift_range = 0.05,
height_shift_range = 0.05,
rotation_range = 45,
shear_range = 0.1,
zoom_range=0.2,
horizontal_flip = True,
vertical_flip = True,
validation_split = 0.2,
)
val_datagen = ImageDataGenerator(
rescale = 1./255,
validation_split = 0.2
)
train_images = datagen.flow_from_directory('/tmp/datasets/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='training',
class_mode='categorical'
)
val_images = val_datagen.flow_from_directory('/tmp/datasets/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='validation',
class_mode='categorical'
)
以下是 flow_from_directory()
记录的输出:
Found 3350 images belonging to 4 classes.
Found 835 images belonging to 4 classes.
完成的拆分不是预期的 3348 | 837 (0.2 * 4185 = 837)
,我是不是错过了什么?还是我误解了参数 validation_split
?
数据拆分为 每个 文件夹 (class) 和 not 整个数据集。查看源代码 here and here 以了解更多信息。这是 flow_from_directory
内部所做的示例:
import os
folders = os.listdir('/content/data')
print(f'folders: {folders}')
total_images = 0
names = []
paths = []
white_list_formats = ('png', 'jpg', 'jpeg', 'bmp', 'ppm', 'tif', 'tiff')
for f in folders:
paths.append(os.listdir(f'/content/data/{f}'))
for d in os.listdir(f'/content/data/{f}'):
if d.lower().endswith(white_list_formats):
names.append(d)
print(f'Total number of valid images found: {len(names)}')
folders: ['Blight', 'Healthy', 'Common_Rust', 'Gray_Leaf_Spot']
Total number of valid images found: 4188
按文件夹拆分数据:
training_samples = 0
for p in paths:
split = (0.2, 1)
num_files = len(p)
start, stop = int(split[0] * num_files), int(split[1] * num_files)
valid_files = p[start: stop]
training_samples += len(valid_files)
print(training_samples)
validation_samples = 0
for p in paths:
split = (0, 0.2)
num_files = len(p)
start, stop = int(split[0] * num_files), int(split[1] * num_files)
valid_files = p[start: stop]
validation_samples += len(valid_files)
print(validation_samples)
3352
836
这对应于您从 flow_from_directory
中看到的内容:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rescale = 1./255,
fill_mode='nearest',
width_shift_range = 0.05,
height_shift_range = 0.05,
rotation_range = 45,
shear_range = 0.1,
zoom_range=0.2,
horizontal_flip = True,
vertical_flip = True,
validation_split = 0.2,
)
val_datagen = ImageDataGenerator(
rescale = 1./255,
validation_split = 0.2
)
train_images = datagen.flow_from_directory('/content/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='training',
shuffle=False,
class_mode='categorical'
)
val_images = val_datagen.flow_from_directory('/content/data',
target_size=(150,150),
batch_size=32,
seed=42,
subset='validation',
shuffle=False,
class_mode='categorical'
)
Found 3352 images belonging to 4 classes.
Found 836 images belonging to 4 classes.
请注意,我没有像您那样删除 3 张图片,但逻辑保持不变。