如何为 python 的 keras LSTM 塑造大型 DataFrame？

Question

我几乎在中找到了我需要的东西。但是有内存问题，因为提供的测试 df 只有 11 行。

我要预测的是使用 LSTM 预测 时间序列 中的 数据提前 10 天 =]回归模型（不是分类器！）。我的 dataframe X 有大约 1500 行和 2000 个特征，属于 shape (1500, 2000) 而 真相值 y 只是 1500 行 的 1 个特征（可以 range any value between -1 and 1）。

由于 LSTM 需要 3D 向量作为输入，我真的苦苦挣扎如何重塑数据.

同样，按照第一段中的示例，它在填充值时因 MemoryError 而崩溃，更具体地说是在 df.cumulative_input_vectors.tolist()。

我的 test（阅读预测）是 shape (10, 2000) 的 dataframe。

由于数据敏感，我实际上无法共享 values/example。我该如何帮助您？

所以，为了使 LSTM 能够从 y 的 1500 行中学习，我应该如何 重塑我的 x 的 1500 行和 2000 个特征？还有，我应该如何重塑我的 forecast 的 10 行和 2000 个特征？

他们将接受-首先是因为我正在学习 LSTM- 一个简单的 LSTM 模型：

model = Sequential() model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') model.fit(train_X, train_y , epochs=50, batch_size=2, verbose=1)

我已经尝试过，但是当 predictin 出现错误时：

# A function to make a 3d data of what I understood needed done: def preprocess_data(stock, seq_len): amount_of_features = len(stock.columns) data = stock.values sequence_length = seq_len #+ 1 result = [] for index in range(len(data) - sequence_length): result.append(data[index : index + sequence_length]) X_train = np.array(result) X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], amount_of_features)) return X_train # creating the train as: # X == the DF of 1500 rows and 2000 features window = 10 train_X = preprocess_data(X[::-1], window)

Answer 1

一段时间后，我设法正确理解了尺寸在哪里。 Keras 期望 .shape (totalRows, sequences, totalColumns) 的 3d 数组。 sequences 一个最让我困惑。

那是因为重塑 df df.reshape(len(df), 1, len(df.columns)) 意味着 keras 会学习 1 行矩阵它给了我不好的结果，因为我不知道 最好为我缩放数据 MinMaxScaler(-1,1) 效果最好，但可能 (0,1).

是什么让我明白首先使用多于 1 行（或多天，因为我的数据集是时间序列）的序列.这意味着 而不是馈送 1 行特征 X 导致 1 值 y，我使用了 类似 5 行特征的东西X 导致 y 的值为 1。如：

# after scaling the df, resulted in "scaled_dataset"
sequences = 5
result = []
# for loop will walk for each of the 1500 rows
for i in range(0,len(scaled_dataset)):
    # every group must have the same length, so if current loop position i + number 
    # of sequences is higher than df length, breaks
    if i+sequences <= len(scaled_dataset):
        # this will add into the list as [[R1a,R1b...R1t],[R2a,R2b...R2t],...[R5a,R5b...R5t]]
        result.append(scaled_dataset[i:i+sequences].values)
# Converting to array + keras takes float32 better than 64
train_x = np.array(result).astype('float32')
# making the y into same length as X
train_y = np.array(y.tail(train_x.shape[0]).values)

train_x.shape, train_y.shape

'>>> (1495, 5, 2400), (1495,)

用另一种方式写在keras上的心态塑造了我的问题：

考虑到这是一个时间序列，上面的意思是 5 天（第 0 行到第 4 行）的数据导致第 5 行的值 y。

然后，减去第一天+最后一天之后的第二天 - 仍然是 5 天 -（第 1 行到第 5 行）的数据得出第 6 行的值 y。

然后，减去第二天+最后一天后的第二天 - 仍然是 5 天 -（第 2 行到第 6 行）的数据得出第 7 行的值 y。

这对于 keras/LSTM 的初学者来说相当混乱，但我希望我能为可能登陆这里的人详细说明这一点。

如何为 python 的 keras LSTM 塑造大型 DataFrame？

How to shape large DataFrame for python's keras LSTM?

python

reshape

lstm

keras