tf.data.Dataset.map() 和 tf.data.Dataset.apply() 之间的区别

Question

随着最近升级到 1.4 版，Tensorflow 将 tf.data 包含在库核心中。 version 1.4 release notes is tf.data.Dataset.apply() 中描述了一个 "major new feature"，这是一种“用于应用自定义转换函数”。这与现有的 tf.data.Dataset.map() 有何不同？

Answer 1

不同的是，map会对Dataset的每个元素分别执行一个函数，而apply会对整个Dataset一次性执行一个函数（例如 group_by_window 在文档中作为示例给出）。

apply的参数是一个接受Dataset和returns一个Dataset的函数，而map的参数是一个接受Dataset的函数一个元素和 returns 一个转换后的元素。

Answer 2

is absolutely correct. You might still be wondering why we introduced Dataset.apply()，我想我会提供一些背景知识。

tf.dataAPI有一组核心转换——像Dataset.map()和Dataset.filter()——通常很有用跨越广泛的数据集，不太可能改变，并作为 tf.data.Dataset 对象上的方法实现。特别是，它们与 TensorFlow 中的其他核心 API 受相同 backwards compatibility guarantees 的约束。

但是，核心方法有点限制。我们还希望在将它们添加到核心之前自由地尝试新的转换，并允许其他库开发人员创建他们自己的可重用转换。因此，在 TensorFlow 1.4 中，我们拆分出一组位于 tf.contrib.data 中的 custom 转换。自定义转换包括一些具有非常特定功能的转换（如 tf.contrib.data.sloppy_interleave()), and some where the API is still in flux (like tf.contrib.data.group_by_window()）。最初，我们将这些自定义转换实现为从 Dataset 到 Dataset 的函数，这对管道的句法流产生了不幸的影响。例如：

dataset = tf.data.TFRecordDataset(...).map(...)

# Method chaining breaks when we apply a custom transformation.
dataset = custom_transformation(dataset, x, y, z)

dataset = dataset.shuffle(...).repeat(...).batch(...)

由于这似乎是一种常见的模式，我们添加了 Dataset.apply() 作为在单个管道中链接核心和自定义转换的方法：

dataset = (tf.data.TFRecordDataset(...)
           .map(...)
           .apply(custom_transformation(x, y, z))
           .shuffle(...)
           .repeat(...)
           .batch(...))

它在宏伟的计划中只是一个次要功能，但希望它有助于使 tf.data 程序更易于阅读，并使库更易于扩展。

Answer 3

我没有足够的声誉发表评论，但我只是想指出您实际上可以使用 map 应用于数据集中的多个元素，这与 @sunreef 自己的评论相反 post。

根据文档，map 将参数作为参数

map_func: A function mapping a nested structure of tensors (having shapes and types defined by self.output_shapes and self.output_types) to another nested structure of tensors.

output_shapes 由数据集定义，可以使用 api 函数（如批处理）进行修改。因此，例如，您可以仅使用 dataset.batch 和 .map 进行批量归一化：

dataset = dataset ...
dataset.batch(batch_size)
dataset.map(normalize_fn)

似乎 apply() 的主要用途是当您真正想要对整个数据集进行转换时。

Answer 4

简单地说，apply()的transformation_func的自变量是Dataset； map() 的 map_func 的参数是 element

tf.data.Dataset.map() 和 tf.data.Dataset.apply() 之间的区别

Difference between tf.data.Dataset.map() and tf.data.Dataset.apply()

python

tensorflow

tensorflow-datasets