如何拆分 LSTM 的训练数据和测试数据以在 Tensorflow 中进行时间序列预测

Question

我最近学习了用于时间序列预测的 LSTM 来自 https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb

在他的教程中，他说：我们将使用以下函数来创建一批从训练中随机挑选的较短的子序列，而不是在近 300k 个观察的完整序列上训练循环神经网络 -数据。

def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""

# Infinite loop.
while True:
    # Allocate a new array for the batch of input-signals.
    x_shape = (batch_size, sequence_length, num_x_signals)
    x_batch = np.zeros(shape=x_shape, dtype=np.float16)

    # Allocate a new array for the batch of output-signals.
    y_shape = (batch_size, sequence_length, num_y_signals)
    y_batch = np.zeros(shape=y_shape, dtype=np.float16)

    # Fill the batch with random sequences of data.
    for i in range(batch_size):
        # Get a random start-index.
        # This points somewhere into the training-data.
        idx = np.random.randint(num_train - sequence_length)

        # Copy the sequences of data starting at this index.
        x_batch[i] = x_train_scaled[idx:idx+sequence_length]
        y_batch[i] = y_train_scaled[idx:idx+sequence_length]

    yield (x_batch, y_batch)

他尝试创建多个 bacth 样本进行训练。

我的问题是，我们可以先随机穿梭 x_train_scaled 和 y_train_scaled，然后使用 follow batch_generator 开始抽取几个 batch size 吗？

我问这个问题的动机是，对于时间序列预测，我们想训练过去并预测未来。所以，穿梭训练样本合法吗？

在教程中，作者选择了一块连续样本如

x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]

我们可以选择 x_batch 和 y_batch 不连续吗？例如x_batch[0]是在10:00am取的，x_batch[1]是在9:00am取的同一天？

总结：下面两个问题是

(1)我们可以先随机穿梭x_train_scaled和y_train_scaled，然后使用follow batch_generator开始抽取几个batch size吗？

(2)我们在训练LSTM的时候，需要考虑时序的影响吗？我们为 LSTM 学习了哪些参数。

谢谢

Answer 1

这在很大程度上取决于数据集。例如，数据集中某一天的天气与周围日子的天气高度相关。因此，在这种情况下，您应该尝试使用有状态的 LSTM（即使用先前记录作为下一个记录的输入的 LSTM）并按顺序进行训练。

但是，如果您的记录（或它们的转换）彼此独立，但取决于某些时间概念，例如记录或这些记录的子集中项目的到达间隔时间，使用洗牌时应该有明显的差异。在某些情况下，它会提高模型的鲁棒性；在其他情况下，它不会一概而论。注意到这些差异是模型评估的一部分。

最后，问题是："time series" 原样确实是一个时间序列（即记录确实依赖于它们的邻居）或者有一些可以打破这种依赖性但保留问题结构的转换？而且，对于这个问题，只有一种方法可以得到答案：探索数据集。

关于权威参考，我不得不让你失望。我从该领域一位经验丰富的研究人员那里学到了这一点，然而，据他说，他是通过大量的实验和失败来学习的。正如他告诉我的：这些不是规则，而是指导方针；尝试所有适合您预算的解决方案；改进最好的；再试一次。

Answer 2

(1) 我们不能。想象一下，试图预测明天的天气。您想要过去 10 小时的一系列温度值，还是想要过去 5 年的随机温度值？

您的数据集是 1 小时间隔内的一长串值。您的 LSTM 接收一系列样本 按时间顺序连接 。例如，使用 sequence_length = 10 它可以将 2018-03-01 09:00:00 到 2018-03-01 19:00:00 的数据作为输入。如果在生成由这些序列组成的批次之前打乱数据集，您将训练 LSTM 根据整个数据集中的随机样本序列进行预测。

(2) 是的，我们需要考虑时间序列的时间顺序。您可以在 python 中找到测试时间序列 LSTM 的方法：https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.

如何拆分 LSTM 的训练数据和测试数据以在 Tensorflow 中进行时间序列预测

How to split the training data and test data for LSTM for time series prediction in Tensorflow

time-series

python-3.x

cross-validation

lstm

tensorflow