tf.data.Dataset：如何获取数据集大小（一个epoch中的元素个数）？

Question

假设我以这种方式定义了一个数据集：

filename_dataset = tf.data.Dataset.list_files("{}/*.png".format(dataset))

如何获取数据集中元素的数量（因此，构成一个纪元的单个元素的数量）？

我知道 tf.data.Dataset 已经知道数据集的维度，因为 repeat() 方法允许将输入管道重复指定数量的 epoch。所以它一定是一种获取这些信息的方法。

Answer 1

tf.data.Dataset.list_files 创建一个名为 MatchingFiles:0 的张量（如果适用，带有适当的前缀）。

你可以评价

tf.shape(tf.get_default_graph().get_tensor_by_name('MatchingFiles:0'))[0]

获取文件数量。

当然，这只适用于简单的情况，特别是如果每张图像只有一个样本（或已知数量的样本）。

在更复杂的情况下，例如当你不知道每个文件中的样本数时，你只能观察一个epoch结束时的样本数。

为此，您可以查看 Dataset 计算的纪元数。 repeat() 创建一个名为 _count 的成员，用于计算 epoch 的数量。通过在迭代期间观察它，您可以发现它何时发生变化并从那里计算数据集大小。

这个计数器可能埋在Dataset层级中，是在依次调用成员函数时创建的，所以我们要这样挖出来。

d = my_dataset
# RepeatDataset seems not to be exposed -- this is a possible workaround 
RepeatDataset = type(tf.data.Dataset().repeat())
try:
  while not isinstance(d, RepeatDataset):
    d = d._input_dataset
except AttributeError:
  warnings.warn('no epoch counter found')
  epoch_counter = None
else:
  epoch_counter = d._count

请注意，使用此技术时，数据集大小的计算并不准确，因为 epoch_counter 递增的批处理通常会混合来自两个连续时期的样本。所以这个计算是精确到你的批量长度。

Answer 2

len(list(dataset)) 在 eager 模式下工作，尽管这显然不是一个好的通用解决方案。

Answer 3

不幸的是，我认为 TF 中还没有这样的功能。然而，使用 TF 2.0 和 eager execution，您可以迭代数据集：

num_elements = 0
for element in dataset:
    num_elements += 1

这是我能想到的最有效的存储方式

这真的感觉像是一个很久以前就应该添加的功能。祈祷他们在以后的版本中添加这个长度特征。

Answer 4

看这里：https://github.com/tensorflow/tensorflow/issues/26966

它不适用于 TFRecord 数据集，但它适用于其他类型。

长话短说：

num_elements = tf.data.experimental.cardinality(dataset).numpy()

Answer 5

更新：

使用 tf.data.experimental.cardinality(dataset) - 参见 here。

对于张量流数据集，您可以使用 _, info = tfds.load(with_info=True)。那么你可以调用info.splits['train'].num_examples。但即使在这种情况下，如果您定义自己的拆分也无法正常工作。

因此您可以对文件进行计数或遍历数据集（如其他答案中所述）：

num_training_examples = 0
num_validation_examples = 0

for example in training_set:
    num_training_examples += 1

for example in validation_set:
    num_validation_examples += 1

Answer 6

对于像 COCO 这样的一些数据集，基数函数没有 return 大小。快速计算数据集大小的一种方法是使用 map reduce，如下所示：

ds.map(lambda x: 1, num_parallel_calls=tf.data.experimental.AUTOTUNE).reduce(tf.constant(0), lambda x,_: x+1)

Answer 7

聚会有点晚了，但对于存储在 TFRecord 数据集中的大型数据集，我使用了这个 (TF 1.15)

import tensorflow as tf
tf.compat.v1.enable_eager_execution()
dataset = tf.data.TFRecordDataset('some_path')
# Count 
n = 0
take_n = 200000
for samples in dataset.batch(take_n):
  n += take_n
  print(n)

Answer 8

在TF2.0中，我是这样做的

for num, _ in enumerate(dataset):
    pass

print(f'Number of elements: {num}')

Answer 9

您可以将其用于 TF2 中的 TFRecords：

ds = tf.data.TFRecordDataset(dataset_filenames)
ds_size = sum(1 for _ in ds)

Answer 10

从 TensorFlow (>=2.3) 开始，可以使用：

dataset.cardinality().numpy()

请注意，.cardinality() 方法已集成到主包中（在 experimental 包中之前）。

请注意，在应用 filter() 操作时，此操作可以 return -2.

Answer 11

这对我有用：

lengt_dataset = dataset.reduce(0, lambda x,_: x+1).numpy()

它遍历您的数据集并递增 var x，它作为数据集的长度返回。

Answer 12

假设您想找出 oxford-iiit-pet 数据集中训练拆分的数量：

ds, info = tfds.load('oxford_iiit_pet', split='train', shuffle_files=True, as_supervised=True, with_info=True)

print(info.splits['train'].num_examples)

Answer 13

您可以在 tensorflow 2.4.0 中使用 len(filename_dataset)

Answer 14

和version=2.5.0一样，你可以直接调用print(dataset.cardinality())来查看数据集的长度和类型。

Answer 15

我很惊讶这个问题没有明确的解决方案，因为这是一个如此简单的功能。当我通过 TQDM 遍历数据集时，我发现 TQDM 找到了数据大小。这是如何工作的？

for x in tqdm(ds['train']):
  //Something

-> 1%|          | 15643/1281167 [00:16<07:06, 2964.90it/s]v

t=tqdm(ds['train'])
t.total
-> 1281167

Answer 16

我看到很多获取样本数量的方法，但实际上你可以很容易地做到这一点 keras:

len(dataset) * BATCH_SIZE

Answer 17

在 TensorFlow 2.6.0 中（我不确定在早期版本中是否可行）：

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#__len__

Dataset.__len__()

Answer 18

对于早期的 Tensorflow 版本（2.1 或更高版本）：

sum(dataset.map(lambda x: 1).as_numpy_iterator())

这样您就不必将数据集中的每个对象都加载到您的运行内存中，而是将 1 加起来然后求和。

tf.data.Dataset：如何获取数据集大小（一个epoch中的元素个数）？

tf.data.Dataset: how to get the dataset size (number of elements in an epoch)?

python

python-3.x

tensorflow

tensorflow-datasets