Tensorflow 以 tfrecord 格式计算大量数据的聚合函数

Question

我有一个非常大的数据集，以 5000 个块的形式存储在多个 tfrecord 文件中。所有这些记录加起来比我的 RAM 大得多。我想做的是从数据集中的 N = 0.05 * TOTAL_SIZE 随机索引中抽取样本，并计算均值和标准偏差以标准化我的数据。

如果不是因为数据集的大小，这会很容易，但是即使我尝试计算我感兴趣的所有张量的总和，我运行内存不足。

# NOTE: count is computed ahead of time by looping over all the tfrecord entries

with tf.device('/cpu:0'):
    sample_size = int(count * 0.05)
    random_indexes = set(np.random.randint(low=0, high=count, size=sample_size))
    stat_graph = tf.Graph()
    with tf.Session(graph=stat_graph) as sess:
        val_sum = np.zeros(shape=(180, 2050))
        for file in files:
            print("Reading from file: %s" % file)
            for record in tf.python_io.tf_record_iterator(file):
                features = tf.parse_single_example(
                    record,
                    features={
                        "val": tf.FixedLenFeature((180, 2050), tf.float32),
                    })
                if index in random_indexes:
                    val_sum += features["val"].eval(session=sess)
                index += 1
        val_mean = val_sum / sample_size

在 tfrecord 数据集上计算某些聚合函数（即均值 and/or 标准差）的正确方法是什么？

Answer 1

我认为 tf.parse_single_example 每次调用时都会向图中添加一个新的张量。而不是上面的，你应该用占位符输入字符串：

...
record_placeholder = tf.placeholder(tf.string)
features = tf.parse_single_example(
    record_placeholder,
    features={
        "val": tf.FixedLenFeature((180, 2050), tf.float32),
    })
for record in tf.python_io.tf_record_iterator(file):
...
val_sum += features["val"].eval(feed_dict={record_placeholder: record}, session=sess)

让我知道这是否有效，因为我无法对其进行测试。

Tensorflow 以 tfrecord 格式计算大量数据的聚合函数

Tensorflow computing an aggregate function over lots of data in tfrecord format

python

tensorflow

tfrecord