Tensorflow：计算 TFRecord 文件中的示例数量——不使用已弃用的“tf.python_io.tf_record_iterator”

Question

请在标记重复之前阅读post:

我一直在寻找一种有效的方法来计算 TFRecord 图像文件中的示例数量。由于 TFRecord 文件不保存有关文件本身的任何元数据，因此用户必须遍历文件才能计算此信息。

Whosebug 上有几个不同的问题可以回答这个问题。 问题是它们似乎都使用了 DEPRECATED tf.python_io.tf_record_iterator 命令，因此这不是一个稳定的解决方案。 这是现有 post 的示例：

所以我想知道是否有一种方法可以使用新数据集计算记录数 API。

Answer 1

我在没有弃用命令的情况下获得了以下代码。希望这会帮助其他人。

使用数据集 API 我设置和迭代器然后循环遍历它。不确定这是否是最快的，但它确实有效。确保批量大小和重复设置为 1，否则代码将 return 批量数量而不是数据集中的示例数量。

count_test = tf.data.TFRecordDataset('testing.tfrecord')
count_test = count_test.map(_parse_image_function)
count_test = count_test.repeat(1)
count_test = count_test.batch(1)
test_counter = count_test.make_one_shot_iterator()

c = 0
for ex in test_counter:
    c += 1
f"There are {c} testing records"

即使在相对较大的文件上，这似乎也能正常工作。

Answer 2

Dataset class 下列出了一个 reduce 方法。他们给出了使用以下方法计算记录的示例：

# generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
ds = tf.data.Dataset.range(5) 
# count the examples by reduce
cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)

## produces 5

不知道这个方法是否比@krishnab 的 for 循环更快。

Answer 3

使用 TensorFlow 2.1 版（使用在中找到的代码）以下对我有用：

def count_tfrecord_examples(
        tfrecords_dir: str,
) -> int:
    """
    Counts the total number of examples in a collection of TFRecord files.

    :param tfrecords_dir: directory that is assumed to contain only TFRecord files
    :return: the total number of examples in the collection of TFRecord files
        found in the specified directory
    """

    count = 0
    for file_name in os.listdir(tfrecords_dir):
        tfrecord_path = os.path.join(tfrecords_dir, file_name)
        count += sum(1 for _ in tf.data.TFRecordDataset(tfrecord_path))

    return count

Tensorflow：计算 TFRecord 文件中的示例数量——不使用已弃用的“tf.python_io.tf_record_iterator”

Tensorflow: Count number of examples in a TFRecord file -- without using deprecated `tf.python_io.tf_record_iterator`

tensorflow

tfrecord