为 LSTM 输入整形 pandas 数据帧

Shaping pandas dataframe for LSTM input

我有一个简单的数据集,仅包含两列(年份和石油价格)。现在,我需要对它们进行整形,以便 keras 的 LSTM 层接受它们的 input_shape.

我的代码是这样的,我基本上需要黄色标记区域的帮助。我认为我需要 alter/convert X_train 和 X_test 之前(数组,规范化等)但我只在尝试时出错...

我认为如果您将 X_train 和 X_test 保留为二维数据框,您的代码就可以工作。所以你的问题会解决如果你定义

X_train = train[["Year"]]
X_test = test[["Year"]]

之后你可以像你在问题中所做的那样定义你的 LSTM 架构

一个解决方案是像 X_train.reshape((X_train.shape[0], 1, 1)) 一样重塑 X_train 并在第一个 LSTM 层中省略 input_shape 参数。 LSTM 层的输入形状始终是(批量大小、时间步长、特征)。查看更多相关信息 here

另一件需要考虑的事情是在“小”范围内缩放数据,例如 [0, 1],这样训练过程可以很好地顺利收敛,因为我们不希望权重在更新时变得疯狂,但是还要看具体的implementation/application.

您可能需要调整旋钮(超参数,例如激活函数、dropuouts、单位、批量大小等,以获得更好的预测性能)。

这是一个包含注释的完整工作示例:

# imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.activations import relu

# data setup
years = [i for i in range(1861, 2021)]
oil = [0.49, 1.05, 3.15, 8.06, 6.59, 3.74, 2.41, 3.63, 3.64, 3.86, 4.34, 3.64, 1.83, 1.17, 1.35, 2.56, 2.42, 1.19, 0.86, 0.95, 0.86, 0.78, 1, 0.84, 0.88, 0.71, 0.67, 0.88, 0.94, 0.87, 0.67, 0.56, 0.64, 0.84, 1.36, 1.18, 0.79, 0.91, 1.29, 1.19, 0.96, 0.8, 0.94, 0.86, 0.62, 0.73, 0.72, 0.72, 0.7, 0.61, 0.61, 0.74, 0.95, 0.81, 0.64, 1.1, 1.56, 1.98, 2.01, 3.07, 1.73, 1.61, 1.34, 1.43, 1.68, 1.88, 1.3, 1.17, 1.27, 1.19, 0.65, 0.87, 0.67, 1, 0.97, 1.09, 1.18, 1.13, 1.02, 1.02, 1.14, 1.19, 1.2, 1.21, 1.05, 1.12, 1.9, 1.99, 1.78, 1.71, 1.71, 1.71, 1.93, 1.93, 1.93, 1.93, 1.9, 2.08, 2.08, 1.9, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 2.24, 2.48, 3.29, 11.58, 11.53, 12.8, 13.92, 14.02, 31.61, 36.83, 35.93, 32.97, 29.55, 28.78, 27.56, 14.43, 18.43503937, 14.9238417, 18.22611328, 23.72582031, 20.0009144, 19.32083658, 16.97163424, 15.81762646, 17.01667969, 20.66848837, 19.09258755, 12.71566148, 17.97007782, 28.49544922, 24.44389105, 25.02325581, 28.83070313, 38.265, 54.52108949, 65.1440625, 72.38907843, 97.25597276, 61.67126482, 79.4955336, 111.2555976, 111.6697024, 108.6585178, 98.94600791, 52.38675889, 43.73416996, 54.19244048, 71.31005976, 64.21057312, 41.83834646]

data = pd.DataFrame(np.vstack((years, oil)).T, columns = ["Year", "Oil Crude Price ($)"]).astype({'Year': int})

# train percentage, thus test percentage = 1 - train_split
train_split = 0.8

# scaler for inputs and outputs
scaler = MinMaxScaler()

# scaling data between 0 and 1
data_scaled = scaler.fit_transform(data.values)

# splitting data into train set 0.6 * 160 = first 96 rows
X_train = data_scaled[:int(train_split * len(data_scaled)),0]
y_train = data_scaled[:int(train_split * len(data_scaled)),1]

# splitting data into test set 0.4 * 160 = last 64 rows
X_test = data_scaled[int(train_split * len(data_scaled)):,0]
y_test = data_scaled[int(train_split * len(data_scaled)):,1]

# sanity check, adding rows in X_train and X_test MUST add to total rows in data
assert len(X_train) + len(X_test) == len(data)

# reshaping inputs for LSTM
X_train_lstm = X_train.reshape((X_train.shape[0], 1, 1))
X_test_lstm = X_test.reshape((X_test.shape[0], 1, 1))

# building model with several LSTM, dropouts, and dense layers
model = Sequential()
model.add(LSTM(units = 512, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 128, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 64, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 32, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 16))
model.add(Dropout(0.2))
model.add(Dense(units = 1))

# compiling model with rmsprop (my preferred optimizer, and loss)
model.compile(optimizer="adam", loss="mse")

# training model for 500 epocs and 40 samples per batch
history = model.fit(X_train_lstm, y_train, epochs=100, batch_size = 20, verbose=1)

# making predictions using test set
y_pred_scaled = model.predict(X_test_lstm)

def original_scale(scaler, x, y):
    return scaler.inverse_transform(np.concatenate((x.reshape((x.shape[0], 1)), y), axis=1))

# transforming values back to original scale 
y_pred = original_scale(scaler, X_test, y_pred_scaled)[:,1] # predicted price
y_test = data.values[int(train_split * data_scaled.shape[0]):,1] #
y_test_years = data.values[int(train_split * data_scaled.shape[0]):,0]

# wrapping up putting results together in a dataframe
output = pd.DataFrame(data = np.vstack((y_test_years, y_test, y_pred)).T, columns = ["Year", "Oil Crude Price ($)", "Predicted Oil Crude Price ($)"]).astype({'Year': int})

print(output)

输出:

    Year  Oil Crude Price ($)  Predicted Oil Crude Price ($)
0   1989            18.226113                      22.428251
1   1990            23.725820                      23.613170
2   1991            20.000914                      24.847811
3   1992            19.320837                      26.132980
4   1993            16.971634                      27.469316
5   1994            15.817626                      28.857338
6   1995            17.016680                      30.297413
7   1996            20.668488                      31.789748
8   1997            19.092588                      33.334372
9   1998            12.715661                      34.931136
10  1999            17.970078                      36.579701
11  2000            28.495449                      38.279516
12  2001            24.443891                      40.029859
13  2002            25.023256                      41.829785
14  2003            28.830703                      43.678103
15  2004            38.265000                      45.573485
16  2005            54.521089                      47.514343
17  2006            65.144063                      49.498909
18  2007            72.389078                      51.525253
19  2008            97.255973                      53.591203
20  2009            61.671265                      55.694481
21  2010            79.495534                      57.832594
22  2011           111.255598                      60.002952
23  2012           111.669702                      62.202827
24  2013           108.658518                      64.429348
25  2014            98.946008                      66.679673
26  2015            52.386759                      68.950699
27  2016            43.734170                      71.239459
28  2017            54.192440                      73.542838
29  2018            71.310060                      75.857814
30  2019            64.210573                      78.181272
31  2020            41.838346                      80.510184

包版本:

keras                     2.4.3
numpy                     1.19.2
pandas                    1.1.5
scikit-learn              0.23.2
tensorflow                2.4.1