如何将 TensorFlow 数据集馈送到 traning_x、training_y、testing_x、testing_y 到 Keras API？

Question

TensorFlow Datasets was a convent tool to utilize the datasets from the internet. However, I got confused about how to feed it into the Input layer in tensor flow Keras API. The dataset used was the tensorflow dataset's emnist.

这是已知的：

要点1：不是将数据集存储到内存中，tensorflow数据库，绕过tensorflow数据模块，在硬盘上预处理数据集，并使用管道（一个 class 类似对象的实例？）将数据提供给 python 函数。它使用 load function.

第 1 期“as_supervised”：但是，有两种“不同”的加载方法，无论是否打开“as_supervised”，

train_ds = tfds.load('mnist', split='train', as_supervised=True,shuffle_files=True)
ds = tfds.load('mnist', split='train', shuffle_files=True)

其中tfds.load，这个关键字被解释为

bool, if True, the returned tf. data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

ds=ds.take(5)
for example in ds:  # example is `{'image': tf.Tensor, 'label': tf.Tensor}`
    image = example["image"]
    label = example["label"]
    print(type(image))
    print(image.shape,"label", label)
    plt.imshow(image) #np.array(inputs).reshape(28,28) maybe needed based on the compiler
    plt.show()

和

for inputs,targets in train_ds:
  print(type(inputs))
  print(inputs.shape,"label=",targets)
  plt.imshow(inputs)
  plt.show()

双双返回

<class 'tensorflow.python.framework.ops.EagerTensor'>
(28, 28, 1) label tf.Tensor(2, shape=(), dtype=int64)

一个。那么在这里使用 as_supervised 参数有什么意义呢？因为数据被标记为 ds_info.supervised_keys(tfds.core.DatasetInfo) 无论如何？

问题2：在split and slice中提供了很多分割训练集和测试集的命令。不幸的是，大部分都没有用。（Anaconda 使用最新模块构建）例如，代码

# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = tfds.load('mnist', split=['train', 'test'])

导致警告

WARNING:absl:Warning: Setting shuffle_files=True because split=TRAIN and shuffle_files=None. This behavior will be deprecated on 2019-08-06, at which point shuffle_files=False will be the default for all splits.

这应该不是问题。然而，

# The full `train` and `test` splits, interleaved together.
train_test_ds = tfds.load('mnist', split='train+test')

# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = tfds.load('mnist', split='train[10:20]')

# The first 10% of train split.
train_10pct_ds = tfds.load('mnist', split='train[:10%]')

# The first 10% of train + the last 80% of train.
train_10_80pct_ds = tfds.load('mnist', split='train[:10%]+train[-80%:]')

全部返回错误和消息

"Invalid split train+test. Available splits are: ['test', 'train']"

这两个参数是在 ds_info 中定义的。但这意味着不能再将“训练”拆分为训练集、测试集，或将训练集限制为整个 'train' 的 10%，更小的尺寸。

c。假设负载仅用于 emnist 的“训练”部分。如何将导入分为train_set和test_set？或者将 train_set 限制为整个“火车”的 10%？

第 3 期 train_ds 是一个管道。无论使用 as_supervised 还是不使用，都需要调用特定元素才能使用它们，使用元组或字典：

input, target in train_ds
elem in ds: elem["image"] elem["label"]

但是tensor flow中的model.fit需要输入

training_x(for input layer): N*(image size)
training_y(for the target) : N*(target size)

要么，管道 train_ds 要么 ds 不是那个。考虑 keras.datasets 中的 nice load 函数，它将训练数据集自动拆分为 x_train 和 y_train。

(x_train,y_train),(x_test,y_test)=tf.keras.datasets.cifar10.load_data()

我考虑过使用 for 循环将来自管道的馈送数据逐个元素地训练到模型 1 中，但这显然不是应该做的。如何将 train_ds 或 ds 馈送到模型中。拟合函数？

Answer 1

第 1 期

关于 as_supervised，根据文档

bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

import tensorflow as tf 
import tensorflow_datasets as tfds

train_ds = tfds.load('mnist', split='train', as_supervised=True,shuffle_files=True)
ds = tfds.load('mnist', split='train', shuffle_files=True)

for example in ds.take(5):  
    image = example["image"]
    label = example["label"]
    print(image.shape, label.shape)

(28, 28, 1) ()
(28, 28, 1) ()
(28, 28, 1) ()
(28, 28, 1) ()
(28, 28, 1) ()

for inputs,targets in train_ds.take(5):
    print(inputs.shape, targets.shape)

(28, 28, 1) ()
(28, 28, 1) ()
(28, 28, 1) ()
(28, 28, 1) ()
(28, 28, 1) ()

正如您在博客 here 中指出的那样，

as_supervised: Returns tuple (img, label) instead of dict {'image': img, 'label': label}

第 2 期

正如我在评论中所述，我在 anaconda 环境中的本地计算机上使用 TF 2.4.1 和 TF DS 4.2.0 进行了测试。正如你所说你正在使用 TF DS 1.2.0，我认为你应该更新这个包。不要依赖 conda 包，而是使用 pip 安装程序。您可能无法使用 conda 安装程序获得更新版本。

第 3 期

如果我根据评论理解了你的查询，我相信你想知道如何将这些数据传递给模型 (.fit)。在这里我将尝试使用这两种方式来加载数据并将其提供给模型。

使用as_supervised=True:

它将 return 训练对（image 和 label）的元组。正如您指出的 this document，展示了如何使用这种加载方法运行模型。 tf.data dataset 应该 return 是 (inputs, targets) 或 (inputs, [=28] 的元组 =], sample_weights).但是，我怀疑那里的模型定义有问题。这是工作代码：

ds_train, ds_info = tfds.load(
    'mnist',
    split='train',
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

def normalize_img(image, label):
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

model = keras.Sequential([
                          keras.Input((28,28,1)),
                          layers.Conv2D(32, 3, activation='relu'),
                          layers.GlobalAveragePooling2D(),
                          layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.01),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

model.fit(
    ds_train,
    epochs=2
)

Epoch 1/2
4ms/step - loss: 2.0585 - sparse_categorical_accuracy: 0.2354
Epoch 2/2
3ms/step - loss: 1.5647 - sparse_categorical_accuracy: 0.4162

使用as_supervised=False:

这样，它将return 一个类似字典的{'image': img, 'label': label}。现在，根据文档，这似乎不容易喂给模型。但是我们可以选择一个解决方法如下

ds = tfds.load('mnist', split='train', shuffle_files=True)

train_x = []
trian_y = []

for example in ds: 
  train_x.append(example["image"])
  trian_y.append(example["label"])

train_x = np.array(train_x)
trian_y = np.array(trian_y)

print(train_x.shape, trian_y.shape)
(60000, 28, 28, 1) (60000,)

并且使用上面相同的模型定义，我们可以按如下方式传递这些训练 paris：

model.fit(
    train_x,
    trian_y,
    epochs=2
)

Epoch 1/2
2ms/step - loss: 1.8551 - sparse_categorical_accuracy: 0.4013
Epoch 2/2
2ms/step - loss: 0.7965 - sparse_categorical_accuracy: 0.7381

仅供参考，我们可以通过多种方式将训练对传递给模型。您可以查看我的其他答案，我们已经在其中进行了更详细的讨论。

如何将 TensorFlow 数据集馈送到 traning_x、training_y、testing_x、testing_y 到 Keras API？

How to feed TensorFlow Datasets into traning_x, training_y, testing_x,testing_y into Keras API?

tensorflow

tensorflow-datasets

tensorflow2.0

第 1 期

第 2 期

第 3 期