何时使用 tensorflow 数据集 api 与 pandas 或 numpy

Question

我看过很多关于在 tensorflow 中将 LSTM 用于时间序列的指南，但我仍然不确定当前在读取和处理数据方面的最佳实践 - 特别是，当一个应该使用 tf.data.Dataset API。

在我的情况下，我的 features 有一个文件 data.csv，我想执行以下两个任务：

计算目标 - t 时刻的目标是某些 horizon 处的某些列，即
```
labels[i] = features[i + h, -1] / features[i, -1] - 1
```
我想h在这里作为一个参数，这样我就可以尝试不同的horizon。
滚动 windows - 出于训练目的，我需要将我的特征滚动到 windows 长度 window:
```
train_features[i] = features[i: i + window]
```

我非常乐意使用 pandas 或 numpy 构建这些 objects，所以我不是在问一般情况下如何实现这一点 - 我的问题具体是这样的管道应该看起来像 tensorflow。

编辑：我想我还想知道我列出的 2 个任务是否适合数据集 api，或者我是否最好使用其他库来处理它们？

Answer 1

首先，请注意，您可以使用数据集 API with pandas 或 tutorial 中所述的 numpy 数组：

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices()

一个更有趣的问题是您应该使用会话 feed_dict 还是通过 Dataset 方法来组织数据管道。如评论中所述，数据集 API 效率更高，因为数据绕过客户端直接流向设备。来自 "Performance Guide":

While feeding data using a feed_dict offers a high level of flexibility, in most instances using feed_dict does not scale optimally. However, in instances where only a single GPU is being used the difference can be negligible. Using the Dataset API is still strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs  
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

但是，正如他们自己所说，差异可能可以忽略不计，并且 GPU 仍然可以通过普通 feed_dict 输入得到充分利用。当训练速度不重要时，没有区别，使用你觉得舒服的任何管道。当速度很重要并且你有一个很大的训练集时，数据集 API 似乎是更好的选择，尤其是你计划分布式计算。

数据集 API 可以很好地处理文本数据，例如 CSV 文件，结帐 this section of the dataset tutorial。

何时使用 tensorflow 数据集 api 与 pandas 或 numpy

When to use tensorflow datasets api versus pandas or numpy

csv

preprocessor

tensorflow

tensorflow-datasets