Tensorflow Dataset 操作等于 timeseries_dataset_from_array 可能吗？

Question

我想要更多地控制 TensorFlow 数据集生成。出于这个原因，我想反映 timeseries_dataset_from_array 的行为，但能够使用连续的 windows 或不重叠的 windows （无法使用 timeseries_dataset_from_array 设置 sequence_stride=0).

 # df_with_inputs = (x, 19) df_with_labels = (x,1)
ds =  tf.data.Dataset.from_tensor_slices((df_with_inputs.values, df_with_labels.values)).window(20, shift=1, stride=1, drop_remainder=True).batch(32)

等于：

ds = tf.keras.preprocessing.timeseries_dataset_from_array(df_with_inputs[df_with_inputs.columns], df_with_labels[df_with_labels.columns], sequence_length=window_size,sequence_stride=1,shuffle=False,batch_size=batch_size)

两者都创建了一个具有相同数量样本的 BatchDataset，但是使用手动方法的数据集的类型规范有些不同，例如，首先，给我：

<BatchDataset shapes: (DatasetSpec(TensorSpec(shape=(19,), dtype=tf.float32, name=None), TensorShape([None])), DatasetSpec(TensorSpec(shape=(1,), dtype=tf.float32, name=None), TensorShape([None]))), types: (DatasetSpec(TensorSpec(shape=(19,), dtype=tf.float32, name=None), TensorShape([None])), DatasetSpec(TensorSpec(shape=(1,), dtype=tf.float32, name=None), TensorShape([None])))>

最后一个给我的地方：

<BatchDataset shapes: ((None, None, 19), (None, 1)), types: (tf.float64, tf.int32)>

。但是两者都包含相同数量的元素，在我的例子中是 3063。请注意，stride 和 sequence_stride 在两种方法中具有不同的行为（对于相同的行为，您需要 shift=1）。此外，当我尝试将第一个馈送到我的 NN 时，我收到以下错误（其中 timeseries_dataset_from_array 的 ds 非常有效）：

TypeError: Inputs to a layer should be tensors.

知道我在这里遗漏了什么吗？

我的模特：

input_shape = (window_size, num_features) #(20,19)
 model = tf.keras.Sequential([
    tf.keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu', padding="same",
                           input_shape=input_shape), [....]])

Answer 1

相当于：

import tensorflow as tf

tf.random.set_seed(345)
samples = 30
df_with_inputs = tf.random.normal((samples, 2), dtype=tf.float32)
df_with_labels = tf.random.uniform((samples, 1), maxval=2, dtype=tf.int32)
batch_size = 2
window_size = 20
ds1 = tf.keras.preprocessing.timeseries_dataset_from_array(df_with_inputs, df_with_labels, sequence_length=window_size,sequence_stride=1,shuffle=False, batch_size=batch_size)
for x, y in ds1.take(1):
  print(x, y)

tf.Tensor(
[[[-0.01898661  1.2348452 ]
  [-0.33379436 -0.13637085]
  [-2.239644    1.5407541 ]
  [-0.14988706  0.50577176]
  [-1.6328571  -0.9512018 ]
  [-3.0481005   0.8019097 ]
  [-0.683125   -0.12166552]
  [-0.5408724  -0.97584397]
  [ 0.47595206  1.0512688 ]
  [ 0.15297593  0.7393363 ]
  [-0.17052855 -0.12541457]
  [ 1.1617764  -2.491248  ]
  [-2.5665069   0.9241422 ]
  [ 0.40681016 -1.031384  ]
  [-0.23945935  1.5275828 ]
  [-1.3431666   0.2940185 ]
  [ 1.7351524   0.34276873]
  [ 0.8059861   2.0647929 ]
  [-0.3017126   0.729208  ]
  [-0.8672192  -0.79938954]]

 [[-0.33379436 -0.13637085]
  [-2.239644    1.5407541 ]
  [-0.14988706  0.50577176]
  [-1.6328571  -0.9512018 ]
  [-3.0481005   0.8019097 ]
  [-0.683125   -0.12166552]
  [-0.5408724  -0.97584397]
  [ 0.47595206  1.0512688 ]
  [ 0.15297593  0.7393363 ]
  [-0.17052855 -0.12541457]
  [ 1.1617764  -2.491248  ]
  [-2.5665069   0.9241422 ]
  [ 0.40681016 -1.031384  ]
  [-0.23945935  1.5275828 ]
  [-1.3431666   0.2940185 ]
  [ 1.7351524   0.34276873]
  [ 0.8059861   2.0647929 ]
  [-0.3017126   0.729208  ]
  [-0.8672192  -0.79938954]
  [-0.14423785  0.95039433]]], shape=(2, 20, 2), dtype=float32) tf.Tensor(
[[1]
 [1]], shape=(2, 1), dtype=int32)

使用 tf.data.Dataset.from_tensor_slices 会是这样的：

ds2 = tf.data.Dataset.from_tensor_slices((df_with_inputs, df_with_labels)).batch(batch_size)
inputs_only_ds = ds2.map(lambda x, y: x)
inputs_only_ds = inputs_only_ds.flat_map(tf.data.Dataset.from_tensor_slices).window(window_size, shift=1, stride=1, drop_remainder=True).flat_map(lambda x: x.batch(window_size)).batch(batch_size)
ds2 = tf.data.Dataset.zip((inputs_only_ds, ds2.map(lambda x, y: y)))
for x, y in ds2.take(1):
  print(x, y)

tf.Tensor(
[[[-0.01898661  1.2348452 ]
  [-0.33379436 -0.13637085]
  [-2.239644    1.5407541 ]
  [-0.14988706  0.50577176]
  [-1.6328571  -0.9512018 ]
  [-3.0481005   0.8019097 ]
  [-0.683125   -0.12166552]
  [-0.5408724  -0.97584397]
  [ 0.47595206  1.0512688 ]
  [ 0.15297593  0.7393363 ]
  [-0.17052855 -0.12541457]
  [ 1.1617764  -2.491248  ]
  [-2.5665069   0.9241422 ]
  [ 0.40681016 -1.031384  ]
  [-0.23945935  1.5275828 ]
  [-1.3431666   0.2940185 ]
  [ 1.7351524   0.34276873]
  [ 0.8059861   2.0647929 ]
  [-0.3017126   0.729208  ]
  [-0.8672192  -0.79938954]]

 [[-0.33379436 -0.13637085]
  [-2.239644    1.5407541 ]
  [-0.14988706  0.50577176]
  [-1.6328571  -0.9512018 ]
  [-3.0481005   0.8019097 ]
  [-0.683125   -0.12166552]
  [-0.5408724  -0.97584397]
  [ 0.47595206  1.0512688 ]
  [ 0.15297593  0.7393363 ]
  [-0.17052855 -0.12541457]
  [ 1.1617764  -2.491248  ]
  [-2.5665069   0.9241422 ]
  [ 0.40681016 -1.031384  ]
  [-0.23945935  1.5275828 ]
  [-1.3431666   0.2940185 ]
  [ 1.7351524   0.34276873]
  [ 0.8059861   2.0647929 ]
  [-0.3017126   0.729208  ]
  [-0.8672192  -0.79938954]
  [-0.14423785  0.95039433]]], shape=(2, 20, 2), dtype=float32) tf.Tensor(
[[1]
 [1]], shape=(2, 1), dtype=int32)

请注意，为了更容易地应用滑动 windows，需要 flap_map 来展平张量。函数 flat_map(lambda x: x.batch(window_size)) 在应用滑动 windows.

后简单地创建了一批扁平张量

使用行 inputs_only_ds = ds2.map(lambda x, y: x) 我只提取数据 (x) 而没有标签 (y) 到运行滑动 windows。之后，在 tf.data.Dataset.zip((inputs_only_ds, ds2.map(lambda x, y: y))) 中，我将数据集与滑动 windows 和标签 (y) 连接/压缩，得到最终结果 ds2.

Tensorflow Dataset 操作等于 timeseries_dataset_from_array 可能吗？

Tensorflow Dataset operation equal to timeseries_dataset_from_array possible?

dataframe

pandas

tensorflow

tensorflow-datasets

tensorflow2.0