TensorFlow：如何使用 'tf.data' 而不是 'load_csv_without_header'？

Question

2 年前，我在 TensorFlow 中编写代码，作为数据加载的一部分，我使用了函数 'load_csv_without_header'。现在，当我运行代码时，我收到消息：

WARNING:tensorflow:From C:\Users\Roi\Desktop\Code_Win_Ver\code_files\Tensor_Flow\version1\build_database_tuple.py:124: load_csv_without_header (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data instead.

如何使用 'tf.data' 代替当前函数？如果没有 csv header 和 tf.data，我怎样才能以相同的格式使用相同的数据类型？我在 Python 3.5.

上使用 TF 1.8.0 版

感谢您的帮助！

Answer 1

您可以使用 tf.TextLineReader，它可以选择跳过 headers

reader = tf.TextLineReader(skip_header_lines=1)

Answer 2

使用 `tf.data` 处理 `csv` 文件：

来自 TensorFlow 的 official documentation:

The tf.data module contains a collection of classes that allows you to easily load data, manipulate it, and pipe it into your model.

使用 API，tf.data.Dataset 旨在作为 TensorFlow 中数据接口的新标准。它代表"a sequence of elements, in which each element contains one or more Tensor objects"。对于 CSV，一个元素只是一行训练示例，表示为分别对应于数据（我们的 x）和标签（"target"）的一对张量。

使用 API，提取 TensorFlow 数据集 (tf.data.Dataset) 中每一行（或更准确地说是每个元素）的主要方法是使用迭代器，TensorFlow 有一个 API 为此命名为 tf.data.Iterator。要 return 下一行，我们可以在迭代器上调用 get_next() 例如。

现在进入代码以获取 csv 并将其转换为我们的张量流数据集。

方法一：`tf.data.TextLineDataset()`和`tf.decode_csv()`

使用更新版本的 TensorFlow 估算器 API，而不是 load_csv_without_header，您可以阅读 CSV 或使用更通用的 tf.data.TextLineDataset(you_train_path)。如果有 header 行，您可以将其与 skip() 链接以跳过第一行，但在您的情况下，这不是必需的。

然后您可以使用 tf.decode_csv() 将 CSV 的每一行解码到其各自的字段中。

代码解法：

import tensorflow as tf
train_path = 'data_input/iris_training.csv'
# if no header, remove .skip()
trainset = tf.data.TextLineDataset(train_path).skip(1)

# Metadata describing the text columns
COLUMNS = ['SepalLength', 'SepalWidth',
           'PetalLength', 'PetalWidth',
           'label']
FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]
def _parse_line(line):
    # Decode the line into its fields
    fields = tf.decode_csv(line, FIELD_DEFAULTS)

    # Pack the result into a dictionary
    features = dict(zip(COLUMNS,fields))

    # Separate the label from the features
    label = features.pop('label')

    return features, label

trainset = trainset.map(_parse_line)
print(trainset)

你会得到：

<MapDataset shapes: ({
    SepalLength: (), 
    SepalWidth: (), 
    PetalLength: (), 
    PetalWidth: ()}, ()), 
types: ({
    SepalLength: tf.float32, 
    SepalWidth: tf.float32, 
    PetalLength: tf.float32, 
    PetalWidth: tf.float32}, tf.int32)>

您可以验证 output classes:

{'PetalLength': tensorflow.python.framework.ops.Tensor,
  'PetalWidth': tensorflow.python.framework.ops.Tensor,
  'SepalLength': tensorflow.python.framework.ops.Tensor,
  'SepalWidth': tensorflow.python.framework.ops.Tensor},
 tensorflow.python.framework.ops.Tensor)

也可以使用get_next遍历迭代器：

x = trainset.make_one_shot_iterator()
x.next()
# Output:
({'PetalLength': <tf.Tensor: id=165, shape=(), dtype=float32, numpy=1.3>,
  'PetalWidth': <tf.Tensor: id=166, shape=(), dtype=float32, numpy=0.2>,
  'SepalLength': <tf.Tensor: id=167, shape=(), dtype=float32, numpy=4.4>,
  'SepalWidth': <tf.Tensor: id=168, shape=(), dtype=float32, numpy=3.2>},
 <tf.Tensor: id=169, shape=(), dtype=int32, numpy=0>)

方法二：`from_tensor_slices()`从numpy构建数据集object或pandas

train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train

mnist_ds = tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
# returns: <TensorSliceDataset shapes: (28,28), types: tf.uint8>

另一个（更详细的）示例：

import pandas as pd

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

# Define the label
targets = california_housing_dataframe["median_house_value"]

# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}                                           

# Construct a dataset, and configure batching/repeating.
ds = tf.data.Dataset.from_tensor_slices((features,targets))

我也强烈建议this article and this，均来自官方文档；可以肯定地说，即使不是全部，也应该涵盖大部分用例，并将帮助您从已弃用的 load_csv_without_header() 函数迁移。

TensorFlow：如何使用 'tf.data' 而不是 'load_csv_without_header'？

TensorFlow: How to use 'tf.data' instead of 'load_csv_without_header'?

python

pycharm

deep-learning

tensorflow

tensorflow-datasets

使用 `tf.data` 处理 `csv` 文件：

方法一：`tf.data.TextLineDataset()`和`tf.decode_csv()`

方法二：`from_tensor_slices()`从numpy构建数据集object或pandas

TensorFlow：如何使用 'tf.data' 而不是 'load_csv_without_header'？

TensorFlow: How to use 'tf.data' instead of 'load_csv_without_header'?

python

pycharm

deep-learning

tensorflow

tensorflow-datasets

使用 tf.data 处理 csv 文件：

方法一：tf.data.TextLineDataset()和tf.decode_csv()

方法二：from_tensor_slices()从numpy构建数据集object或pandas

使用 `tf.data` 处理 `csv` 文件：

方法一：`tf.data.TextLineDataset()`和`tf.decode_csv()`

方法二：`from_tensor_slices()`从numpy构建数据集object或pandas