优化 tensorflow 数据集中的洗牌缓冲区大小 api

Question

我正在尝试使用 dataset api 来加载数据，但发现我大部分时间都在将数据加载到随机缓冲区中。我该如何优化此管道以最大程度地减少填充随机缓冲区所花费的时间。

(tf.data.Dataset.list_files(path)
   .shuffle(num_files)  # number of tfrecord files 
   .apply(tf.contrib.data.parallel_interleave(lambda f: tf.data.TFRecordDataset(f), cycle_length=num_files))
   .shuffle(num_items)  # number of images in the dataset
   .map(parse_func, num_parallel_calls=8)
   .map(get_patches, num_parallel_calls=8)
   .apply(tf.contrib.data.unbatch())
   # Patch buffer is currently the number of patches extracted per image
   .apply(tf.contrib.data.shuffle_and_repeat(patch_buffer))
   .batch(64)
   .prefetch(1)
   .make_one_shot_iterator())

Answer 1

因为我最多有几千张图片，所以我解决这个问题的方法是为每张图片创建一个单独的 tfrecord 文件。这样就可以对单个图像进行洗牌，而不必先将它们加载到内存中。这大大减少了需要发生的缓冲。

优化 tensorflow 数据集中的洗牌缓冲区大小 api

Optimizing shuffle buffer size in tensorflow dataset api

python

tensorflow

tensorflow-datasets