TensorFlow 数据集：通过 For 循环迭代时顺序出现随机化？

Question

我正在创建一些批量 TensorFlow 数据集 tf.keras.preprocessing.image_dataset_from_directory:

image_size = (90, 120)
batch_size = 32

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'train'),
    validation_split=0.25,
    subset="training",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'train'),
    validation_split=0.25,
    subset="validation",
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(model_split_dir,'test'),
    seed=1,
    image_size=image_size,
    batch_size=batch_size
)

如果我随后使用以下 for 循环从其中一个数据集获取图像和标签信息，我每次运行都会得到不同的输出：

for images, labels in test_ds:
  print(labels)

例如，第一个批次会出现这样的运行:

tf.Tensor([0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1], shape=(32,), dtype=int32)

但是再循环一次运行就完全不一样了；

tf.Tensor([1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0], shape=(32,), dtype=int32)

怎么每次循环顺序都不一样？ TensorFlow 数据集是无序的吗？根据我的发现，它们应该是有序的，所以我不知道为什么 for 循环 returns 标签每次都以不同的顺序排列。

任何对此的见解将不胜感激。

更新：数据集顺序的改组按预期进行。对于我的测试数据，我只需要将 shuffle 设置为 False。非常感谢@AloneTogether !

Answer 1

tf.keras.preprocessing.image_dataset_from_directory的参数shuffle默认设置为True，如果你想要确定性的结果，可以尝试设置为False:

import tensorflow as tf
import pathlib

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  image_size=(28, 28),
  batch_size=5,
  shuffle=False)

for x, y in train_ds:
  print(y)
  break

另一方面，这将始终产生随机结果：

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  seed=None,
  image_size=(28, 28),
  batch_size=5,
  shuffle=True)

for x, y in train_ds:
  print(y)
  break

如果您设置随机种子和 shuffle=True，数据集将被打乱一次，但您将获得确定性的结果：

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  seed=123,
  image_size=(28, 28),
  batch_size=5,
  shuffle=True)

for x, y in train_ds:
  print(y)
  break

TensorFlow 数据集：通过 For 循环迭代时顺序出现随机化？

TensorFlow Dataset: Order appears randomised when iterating via For loop?

python

for-loop

keras

tensorflow

tensorflow-datasets