为 LSTM 输入整形 pandas 数据帧
Shaping pandas dataframe for LSTM input
我有一个简单的数据集,仅包含两列(年份和石油价格)。现在,我需要对它们进行整形,以便 keras 的 LSTM 层接受它们的 input_shape.
我的代码是这样的,我基本上需要黄色标记区域的帮助。我认为我需要 alter/convert X_train 和 X_test 之前(数组,规范化等)但我只在尝试时出错...
我认为如果您将 X_train 和 X_test 保留为二维数据框,您的代码就可以工作。所以你的问题会解决如果你定义
X_train = train[["Year"]]
X_test = test[["Year"]]
之后你可以像你在问题中所做的那样定义你的 LSTM 架构
一个解决方案是像 X_train.reshape((X_train.shape[0], 1, 1))
一样重塑 X_train
并在第一个 LSTM 层中省略 input_shape
参数。 LSTM 层的输入形状始终是(批量大小、时间步长、特征)。查看更多相关信息 here。
另一件需要考虑的事情是在“小”范围内缩放数据,例如 [0, 1],这样训练过程可以很好地顺利收敛,因为我们不希望权重在更新时变得疯狂,但是还要看具体的implementation/application.
您可能需要调整旋钮(超参数,例如激活函数、dropuouts、单位、批量大小等,以获得更好的预测性能)。
这是一个包含注释的完整工作示例:
# imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.activations import relu
# data setup
years = [i for i in range(1861, 2021)]
oil = [0.49, 1.05, 3.15, 8.06, 6.59, 3.74, 2.41, 3.63, 3.64, 3.86, 4.34, 3.64, 1.83, 1.17, 1.35, 2.56, 2.42, 1.19, 0.86, 0.95, 0.86, 0.78, 1, 0.84, 0.88, 0.71, 0.67, 0.88, 0.94, 0.87, 0.67, 0.56, 0.64, 0.84, 1.36, 1.18, 0.79, 0.91, 1.29, 1.19, 0.96, 0.8, 0.94, 0.86, 0.62, 0.73, 0.72, 0.72, 0.7, 0.61, 0.61, 0.74, 0.95, 0.81, 0.64, 1.1, 1.56, 1.98, 2.01, 3.07, 1.73, 1.61, 1.34, 1.43, 1.68, 1.88, 1.3, 1.17, 1.27, 1.19, 0.65, 0.87, 0.67, 1, 0.97, 1.09, 1.18, 1.13, 1.02, 1.02, 1.14, 1.19, 1.2, 1.21, 1.05, 1.12, 1.9, 1.99, 1.78, 1.71, 1.71, 1.71, 1.93, 1.93, 1.93, 1.93, 1.9, 2.08, 2.08, 1.9, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 2.24, 2.48, 3.29, 11.58, 11.53, 12.8, 13.92, 14.02, 31.61, 36.83, 35.93, 32.97, 29.55, 28.78, 27.56, 14.43, 18.43503937, 14.9238417, 18.22611328, 23.72582031, 20.0009144, 19.32083658, 16.97163424, 15.81762646, 17.01667969, 20.66848837, 19.09258755, 12.71566148, 17.97007782, 28.49544922, 24.44389105, 25.02325581, 28.83070313, 38.265, 54.52108949, 65.1440625, 72.38907843, 97.25597276, 61.67126482, 79.4955336, 111.2555976, 111.6697024, 108.6585178, 98.94600791, 52.38675889, 43.73416996, 54.19244048, 71.31005976, 64.21057312, 41.83834646]
data = pd.DataFrame(np.vstack((years, oil)).T, columns = ["Year", "Oil Crude Price ($)"]).astype({'Year': int})
# train percentage, thus test percentage = 1 - train_split
train_split = 0.8
# scaler for inputs and outputs
scaler = MinMaxScaler()
# scaling data between 0 and 1
data_scaled = scaler.fit_transform(data.values)
# splitting data into train set 0.6 * 160 = first 96 rows
X_train = data_scaled[:int(train_split * len(data_scaled)),0]
y_train = data_scaled[:int(train_split * len(data_scaled)),1]
# splitting data into test set 0.4 * 160 = last 64 rows
X_test = data_scaled[int(train_split * len(data_scaled)):,0]
y_test = data_scaled[int(train_split * len(data_scaled)):,1]
# sanity check, adding rows in X_train and X_test MUST add to total rows in data
assert len(X_train) + len(X_test) == len(data)
# reshaping inputs for LSTM
X_train_lstm = X_train.reshape((X_train.shape[0], 1, 1))
X_test_lstm = X_test.reshape((X_test.shape[0], 1, 1))
# building model with several LSTM, dropouts, and dense layers
model = Sequential()
model.add(LSTM(units = 512, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 128, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 64, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 32, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 16))
model.add(Dropout(0.2))
model.add(Dense(units = 1))
# compiling model with rmsprop (my preferred optimizer, and loss)
model.compile(optimizer="adam", loss="mse")
# training model for 500 epocs and 40 samples per batch
history = model.fit(X_train_lstm, y_train, epochs=100, batch_size = 20, verbose=1)
# making predictions using test set
y_pred_scaled = model.predict(X_test_lstm)
def original_scale(scaler, x, y):
return scaler.inverse_transform(np.concatenate((x.reshape((x.shape[0], 1)), y), axis=1))
# transforming values back to original scale
y_pred = original_scale(scaler, X_test, y_pred_scaled)[:,1] # predicted price
y_test = data.values[int(train_split * data_scaled.shape[0]):,1] #
y_test_years = data.values[int(train_split * data_scaled.shape[0]):,0]
# wrapping up putting results together in a dataframe
output = pd.DataFrame(data = np.vstack((y_test_years, y_test, y_pred)).T, columns = ["Year", "Oil Crude Price ($)", "Predicted Oil Crude Price ($)"]).astype({'Year': int})
print(output)
输出:
Year Oil Crude Price ($) Predicted Oil Crude Price ($)
0 1989 18.226113 22.428251
1 1990 23.725820 23.613170
2 1991 20.000914 24.847811
3 1992 19.320837 26.132980
4 1993 16.971634 27.469316
5 1994 15.817626 28.857338
6 1995 17.016680 30.297413
7 1996 20.668488 31.789748
8 1997 19.092588 33.334372
9 1998 12.715661 34.931136
10 1999 17.970078 36.579701
11 2000 28.495449 38.279516
12 2001 24.443891 40.029859
13 2002 25.023256 41.829785
14 2003 28.830703 43.678103
15 2004 38.265000 45.573485
16 2005 54.521089 47.514343
17 2006 65.144063 49.498909
18 2007 72.389078 51.525253
19 2008 97.255973 53.591203
20 2009 61.671265 55.694481
21 2010 79.495534 57.832594
22 2011 111.255598 60.002952
23 2012 111.669702 62.202827
24 2013 108.658518 64.429348
25 2014 98.946008 66.679673
26 2015 52.386759 68.950699
27 2016 43.734170 71.239459
28 2017 54.192440 73.542838
29 2018 71.310060 75.857814
30 2019 64.210573 78.181272
31 2020 41.838346 80.510184
包版本:
keras 2.4.3
numpy 1.19.2
pandas 1.1.5
scikit-learn 0.23.2
tensorflow 2.4.1
我有一个简单的数据集,仅包含两列(年份和石油价格)。现在,我需要对它们进行整形,以便 keras 的 LSTM 层接受它们的 input_shape.
我的代码是这样的,我基本上需要黄色标记区域的帮助。我认为我需要 alter/convert X_train 和 X_test 之前(数组,规范化等)但我只在尝试时出错...
我认为如果您将 X_train 和 X_test 保留为二维数据框,您的代码就可以工作。所以你的问题会解决如果你定义
X_train = train[["Year"]]
X_test = test[["Year"]]
之后你可以像你在问题中所做的那样定义你的 LSTM 架构
一个解决方案是像 X_train.reshape((X_train.shape[0], 1, 1))
一样重塑 X_train
并在第一个 LSTM 层中省略 input_shape
参数。 LSTM 层的输入形状始终是(批量大小、时间步长、特征)。查看更多相关信息 here。
另一件需要考虑的事情是在“小”范围内缩放数据,例如 [0, 1],这样训练过程可以很好地顺利收敛,因为我们不希望权重在更新时变得疯狂,但是还要看具体的implementation/application.
您可能需要调整旋钮(超参数,例如激活函数、dropuouts、单位、批量大小等,以获得更好的预测性能)。
这是一个包含注释的完整工作示例:
# imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.activations import relu
# data setup
years = [i for i in range(1861, 2021)]
oil = [0.49, 1.05, 3.15, 8.06, 6.59, 3.74, 2.41, 3.63, 3.64, 3.86, 4.34, 3.64, 1.83, 1.17, 1.35, 2.56, 2.42, 1.19, 0.86, 0.95, 0.86, 0.78, 1, 0.84, 0.88, 0.71, 0.67, 0.88, 0.94, 0.87, 0.67, 0.56, 0.64, 0.84, 1.36, 1.18, 0.79, 0.91, 1.29, 1.19, 0.96, 0.8, 0.94, 0.86, 0.62, 0.73, 0.72, 0.72, 0.7, 0.61, 0.61, 0.74, 0.95, 0.81, 0.64, 1.1, 1.56, 1.98, 2.01, 3.07, 1.73, 1.61, 1.34, 1.43, 1.68, 1.88, 1.3, 1.17, 1.27, 1.19, 0.65, 0.87, 0.67, 1, 0.97, 1.09, 1.18, 1.13, 1.02, 1.02, 1.14, 1.19, 1.2, 1.21, 1.05, 1.12, 1.9, 1.99, 1.78, 1.71, 1.71, 1.71, 1.93, 1.93, 1.93, 1.93, 1.9, 2.08, 2.08, 1.9, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 2.24, 2.48, 3.29, 11.58, 11.53, 12.8, 13.92, 14.02, 31.61, 36.83, 35.93, 32.97, 29.55, 28.78, 27.56, 14.43, 18.43503937, 14.9238417, 18.22611328, 23.72582031, 20.0009144, 19.32083658, 16.97163424, 15.81762646, 17.01667969, 20.66848837, 19.09258755, 12.71566148, 17.97007782, 28.49544922, 24.44389105, 25.02325581, 28.83070313, 38.265, 54.52108949, 65.1440625, 72.38907843, 97.25597276, 61.67126482, 79.4955336, 111.2555976, 111.6697024, 108.6585178, 98.94600791, 52.38675889, 43.73416996, 54.19244048, 71.31005976, 64.21057312, 41.83834646]
data = pd.DataFrame(np.vstack((years, oil)).T, columns = ["Year", "Oil Crude Price ($)"]).astype({'Year': int})
# train percentage, thus test percentage = 1 - train_split
train_split = 0.8
# scaler for inputs and outputs
scaler = MinMaxScaler()
# scaling data between 0 and 1
data_scaled = scaler.fit_transform(data.values)
# splitting data into train set 0.6 * 160 = first 96 rows
X_train = data_scaled[:int(train_split * len(data_scaled)),0]
y_train = data_scaled[:int(train_split * len(data_scaled)),1]
# splitting data into test set 0.4 * 160 = last 64 rows
X_test = data_scaled[int(train_split * len(data_scaled)):,0]
y_test = data_scaled[int(train_split * len(data_scaled)):,1]
# sanity check, adding rows in X_train and X_test MUST add to total rows in data
assert len(X_train) + len(X_test) == len(data)
# reshaping inputs for LSTM
X_train_lstm = X_train.reshape((X_train.shape[0], 1, 1))
X_test_lstm = X_test.reshape((X_test.shape[0], 1, 1))
# building model with several LSTM, dropouts, and dense layers
model = Sequential()
model.add(LSTM(units = 512, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 128, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 64, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 32, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 16))
model.add(Dropout(0.2))
model.add(Dense(units = 1))
# compiling model with rmsprop (my preferred optimizer, and loss)
model.compile(optimizer="adam", loss="mse")
# training model for 500 epocs and 40 samples per batch
history = model.fit(X_train_lstm, y_train, epochs=100, batch_size = 20, verbose=1)
# making predictions using test set
y_pred_scaled = model.predict(X_test_lstm)
def original_scale(scaler, x, y):
return scaler.inverse_transform(np.concatenate((x.reshape((x.shape[0], 1)), y), axis=1))
# transforming values back to original scale
y_pred = original_scale(scaler, X_test, y_pred_scaled)[:,1] # predicted price
y_test = data.values[int(train_split * data_scaled.shape[0]):,1] #
y_test_years = data.values[int(train_split * data_scaled.shape[0]):,0]
# wrapping up putting results together in a dataframe
output = pd.DataFrame(data = np.vstack((y_test_years, y_test, y_pred)).T, columns = ["Year", "Oil Crude Price ($)", "Predicted Oil Crude Price ($)"]).astype({'Year': int})
print(output)
输出:
Year Oil Crude Price ($) Predicted Oil Crude Price ($)
0 1989 18.226113 22.428251
1 1990 23.725820 23.613170
2 1991 20.000914 24.847811
3 1992 19.320837 26.132980
4 1993 16.971634 27.469316
5 1994 15.817626 28.857338
6 1995 17.016680 30.297413
7 1996 20.668488 31.789748
8 1997 19.092588 33.334372
9 1998 12.715661 34.931136
10 1999 17.970078 36.579701
11 2000 28.495449 38.279516
12 2001 24.443891 40.029859
13 2002 25.023256 41.829785
14 2003 28.830703 43.678103
15 2004 38.265000 45.573485
16 2005 54.521089 47.514343
17 2006 65.144063 49.498909
18 2007 72.389078 51.525253
19 2008 97.255973 53.591203
20 2009 61.671265 55.694481
21 2010 79.495534 57.832594
22 2011 111.255598 60.002952
23 2012 111.669702 62.202827
24 2013 108.658518 64.429348
25 2014 98.946008 66.679673
26 2015 52.386759 68.950699
27 2016 43.734170 71.239459
28 2017 54.192440 73.542838
29 2018 71.310060 75.857814
30 2019 64.210573 78.181272
31 2020 41.838346 80.510184
包版本:
keras 2.4.3
numpy 1.19.2
pandas 1.1.5
scikit-learn 0.23.2
tensorflow 2.4.1