使用 feed_dict 比使用数据集 API 快 5 倍以上？

Question

我创建了一个 TFRecord 格式的数据集用于测试。每个条目包含 200 列，名为 C1 - C199，每个都是一个字符串列表，以及一个 label 列来表示标签。可以在此处找到创建数据的代码：https://github.com/codescv/tf-dist/blob/8bb3c44f55939fc66b3727a730c57887113e899c/src/gen_data.py#L25

然后我使用线性模型来训练数据。第一种方法如下所示：

dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.map(parse_tfrecord, num_parallel_calls=5)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)

features, labels = dataset.make_one_shot_iterator().get_next()    
logits = tf.feature_column.linear_model(features=features, feature_columns=columns, cols_to_vars=cols_to_vars)
train_op = ...

with tf.Session() as sess:
    sess.run(train_op)

完整代码可以在这里找到：https://github.com/codescv/tf-dist/blob/master/src/lr_single.py

当我运行上面的代码时，我得到 0.85 steps/sec（批量大小为 1024）。

在第二种方法中，我手动将批次从数据集中获取到 python，然后将它们提供给占位符，如下所示：

example = tf.placeholder(dtype=tf.string, shape=[None])
features = tf.parse_example(example, features=tf.feature_column.make_parse_example_spec(columns+[tf.feature_column.numeric_column('label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
train_op = ...

dataset = tf.data.TFRecordDataset(data_file).repeat().batch(batch_size)
next_batch = dataset.make_one_shot_iterator().get_next()

with tf.Session() as sess:
    data_batch = sess.run(next_batch)
    sess.run(train_op, feed_dict={example: data_batch})

完整代码可以在这里找到：https://github.com/codescv/tf-dist/blob/master/src/lr_single_feed.py

当我运行上面的代码时，我得到 5 steps/sec。这比第一种方法快 5 倍。这是我不明白的，因为理论上第二个应该由于额外的 serialization/deserialization 数据批次而变慢。

谢谢！

Answer 1

目前（从 TensorFlow 1.9 开始）在使用 tf.data 映射和批处理具有大量特征且每个特征数据量都很少的张量时存在性能问题。该问题有两个原因：

dataset.map(parse_tfrecord, ...) 转换将执行 O(batch_size * num_columns) 小操作来创建批处理。相比之下，将 tf.placeholder() 馈送到 tf.parse_example() 将执行 O(1) 操作来创建相同的批次。
使用 dataset.batch() 批处理许多 tf.SparseTensor 对象比直接创建与 tf.parse_example().[=39 的输出相同的 tf.SparseTensor 慢得多=]

这两个问题的改进正在进行中，并且应该在未来版本的 TensorFlow 中可用。同时，您可以通过切换 dataset.map() 和 dataset.batch() 的顺序并重写 dataset.map() 以处理向量来提高基于 tf.data 的管道的性能字符串，例如基于喂养的版本：

dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.repeat(num_epochs)

# Batch first to create a vector of strings as input to the map(). 
dataset = dataset.batch(batch_size)

def parse_tfrecord_batch(record_batch):
  features = tf.parse_example(
      record_batch,
      features=tf.feature_column.make_parse_example_spec(
          columns + [
              tf.feature_column.numeric_column(
                  'label', dtype=tf.float32, default_value=0)]))
  labels = features.pop('label')
  return features, labels

# NOTE: Parallelism might not be as useful, because the individual map function now does
# more work per invocation, but you might want to experiment with this.
dataset = dataset.map(parse_tfrecord_batch)

# Add a prefetch at the end to pipeline execution.
dataset = dataset.prefetch(1)

features, labels = dataset.make_one_shot_iterator().get_next()    
# ...

编辑（2018/6/18）：从评论中回答您的问题：

Why is dataset.map(parse_tfrecord, ...) O(batch_size * num_columns), not O(batch_size)? If parsing requires enumeration of the columns, why doesn't parse_example take O(num_columns)?

当您将 TensorFlow 代码包装在 Dataset.map()（或其他函数转换）中时，每个输出的恒定数量的额外操作会添加到函数的 "return" 值中，并且（在 tf.SparseTensor 值) "convert" 将它们转换为标准格式。当您直接将 tf.parse_example() 的输出传递给模型的输入时，不会添加这些操作。虽然它们是非常小的操作，但执行如此多的操作可能会成为瓶颈。（从技术上讲，解析 does 需要 O(batch_size * num_columns) time，但是解析中涉及的常量很多小于执行操作。）

Why do you add a prefetch at the end of the pipeline?

当您对性能感兴趣时，这几乎总是最好的做法，它应该可以提高管道的整体性能。有关最佳实践的更多信息，请参阅 performance guide for tf.data。

使用 feed_dict 比使用数据集 API 快 5 倍以上？

Using feed_dict is more than 5x faster than using dataset API?

tensorflow

tensorflow-datasets