如何将 image_dataset_from_directory 获得的数据集拆分为数据和标签？

Question

我正在尝试使用 Python 在 TensorFlow 中构建 CNN。我已将图像加载到数据集中，如下所示：

dataset = tf.keras.preprocessing.image_dataset_from_directory(
    "train_data", shuffle=True, image_size=(578, 260),
    batch_size=BATCH_SIZE)

但是，如果我想在这个数据集上使用train_test_split或fit_resample，我需要把它分成数据和标签。我是 TensorFlow 的新手，不知道该怎么做。非常感谢任何帮助。

Answer 1

您可以使用 subset 参数将您的数据分成 training 和 validation。

import tensorflow as tf
import pathlib

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)


train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  image_size=(256, 256),
  seed=1,
  batch_size=32)

val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=1,
  image_size=(256, 256),
  batch_size=32)

for x, y in train_ds.take(1):
  print('Image --> ', x.shape, 'Label --> ',  y.shape)

Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
Image -->  (32, 256, 256, 3) Label -->  (32,)

至于你的标签，根据docs:

Either "inferred" (labels are generated from the directory structure), None (no labels), or a list/tuple of integer labels of the same size as the number of image files found in the directory. Labels should be sorted according to the alphanumeric order of the image file paths (obtained via os.walk(directory) in Python).

所以只需尝试遍历 train_ds 并查看它们是否存在。您还可以使用参数 label_mode 来引用您拥有的标签类型，并使用 class_names 来明确列出您的类.

如果你的类不平衡，可以使用model.fit(*)的class_weights参数。有关详细信息，请查看此 post.

如何将 image_dataset_from_directory 获得的数据集拆分为数据和标签？

How can I split the dataset obtained from image_dataset_from_directory into data and labels?

python

keras

tensorflow

tensorflow-datasets