tensorflow 的 flat_map + window.batch() 对 dataset/array 做了什么？

Question

我正在学习一门关于使用 Tensorflow 进行时间序列预测的在线课程。用于将 Numpy 数组 (TS) 转换为使用的 Tensorflow 数据集的函数是基于 LSTM 的模型已经给出（带有我的注释行）：

def windowed_dataset(series, window_size, batch_size, shuffle_buffer):
     # creating a tensor from an array
     dataset = tf.data.Dataset.from_tensor_slices(series)
     # cutting the tensor into fixed-size windows
     dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)  
     # joining windows into a batch?
     dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
     # separating row into features/label
     dataset = dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
     dataset = dataset.batch(batch_size).prefetch(1)
     return dataset

这段代码工作正常，但我想更好地理解它 modify/adapt 它以满足我的需要。

如果我删除 dataset.flat_map(lambda window: window.batch(window_size + 1)) 操作，我会收到指向行的 TypeError: '_VariantDataset' object is not subscriptable：lambda window: (window[:-1], window[-1]))

我设法将部分代码（跳过改组）重写为基于 Numpy 的代码：

def windowed_dataset_np(series, window_size):
    values = sliding_window_view(series, window_size)
    X = values[:, :-1]
    X = tf.convert_to_tensor(np.expand_dims(X, axis=-1))
    y = values[:,-1]
    return X, y

模型拟合的语法看起来有点不同，但效果很好。

我的两个问题是：

dataset.flat_map(lambda window: window.batch(window_size + 1))实现了什么？
第二个代码真的等同于原函数中的前三个操作吗？

Answer 1

我会将操作分解成更小的部分以真正了解正在发生的事情，因为将 window 应用于数据集实际上会创建包含张量序列的 windowed 数据集的数据集：

import tensorflow as tf

window_size = 2
dataset = tf.data.Dataset.range(7)
dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)  

for i, window in enumerate(dataset):
  print('{}. windowed dataset'.format(i + 1))
  for w in window:
    print(w)

1. windowed dataset
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
2. windowed dataset
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
3. windowed dataset
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
4. windowed dataset
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
5. windowed dataset
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)

请注意，由于参数 shift=1，window 总是移动一个位置。现在，这里使用 flat_map 操作将数据集的数据集展平为元素的数据集；但是，您仍然希望保留创建的 windowed 序列，因此您根据 window 参数使用 dataset.batch:

划分数据集

dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
for w in dataset:
  print(w)

tf.Tensor([0 1 2], shape=(3,), dtype=int64)
tf.Tensor([1 2 3], shape=(3,), dtype=int64)
tf.Tensor([2 3 4], shape=(3,), dtype=int64)
tf.Tensor([3 4 5], shape=(3,), dtype=int64)
tf.Tensor([4 5 6], shape=(3,), dtype=int64)

如果要创建 windowed 序列，您也可以先展平数据集的数据集，然后应用 batch：

dataset = dataset.flat_map(lambda window: window).batch(window_size + 1)

或者只展平数据集的数据集：

dataset = dataset.flat_map(lambda window: window)
for w in dataset:
  print(w)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)

但这可能不是您想要的。关于您问题中的这一行：dataset = dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))，这是非常微不足道的。它只是简单地将数据拆分为序列和标签，使用每个序列的最后一个元素作为标签：

dataset = dataset.shuffle(2).map(lambda window: (window[:-1], window[-1]))
for w in dataset:
  print(w)

(<tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>, <tf.Tensor: shape=(), dtype=int64, numpy=3>)
(<tf.Tensor: shape=(2,), dtype=int64, numpy=array([2, 3])>, <tf.Tensor: shape=(), dtype=int64, numpy=4>)
(<tf.Tensor: shape=(2,), dtype=int64, numpy=array([3, 4])>, <tf.Tensor: shape=(), dtype=int64, numpy=5>)
(<tf.Tensor: shape=(2,), dtype=int64, numpy=array([4, 5])>, <tf.Tensor: shape=(), dtype=int64, numpy=6>)
(<tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)

tensorflow 的 flat_map + window.batch() 对 dataset/array 做了什么？

What tensorflow's flat_map + window.batch() does to a dataset/array?

python

tensorflow

tensorflow-datasets

data-preprocessing