情节预测的索引和日期问题
Index and Date Problem on Plot Prediction
我有一个数据框:
import yfinance as yf
df = yf.download('AAPL',
start='2001-01-01',
end='2005-12-31',
progress=False)
然后我将它分成比例为 80:20 的训练测试集。这是一些代码来检查我的火车和测试集的索引。
train_df.index
输出为
test_df.index
输出为
从训练数据得到模型后,我用252个测试数据做预测,结果是
如何将预测输出更改为具有日期时间 %Y%m%d 索引而不是整数索引的数据帧?我在这个Whosebug上看了很多文章和答案,但是我还没有找到解决办法。
您可以做的一件事是简单地在模型 training/inference 之前保存日期时间索引,然后将其重新加入到 RangeIndex 中。
即:
time_index = df.reset_index()[['utc']] #replace utc with your index name
df = df.reset_index()
训练模型,然后加入 RangeIndex。然后将索引设置回 DatetimeIndex。
prediction = prediction.join(time_index)
prediction.set_index('utc', inplace=True)
工作示例:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':np.arange(10)}, index=pd.date_range('2021-01-01', '2021-01-10'))
df.index.name = 'Date'
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
#Some arbitrary prediction dataframe with a RangeIndex
prediction = pd.DataFrame({'predictions':np.arange(0,10)})
#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)
#Sets index to the time_index
prediction.set_index('Date', inplace=True)
您现在将拥有如下所示的数据框:
predictions
Date
2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
2021-01-08 7
2021-01-09 8
2021-01-10 9
为了开车回家,这是一个使用您的数据源的具体示例:
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = yf.download('AAPL',
start='2001-01-01',
end='2005-12-31',
progress=False)
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
df = df.reset_index()
#Assuming we predict Volume
y = df[['Volume']]
X = df.drop(columns=['Volume', 'Date'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
#Predict values, transpose to fit into dataframe
predicted_values = model.predict(X_test).T[0]
#Create prediction dataframe
prediction = pd.DataFrame({'y-pred':predicted_values}, index=X_test.index)
#join test or true data to prediction for comparison
prediction = prediction.join(y_test)
#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)
#Sets index to the time_index
prediction.set_index('Date', inplace=True)
这导致:
y-pred Volume
Date
2001-07-26 3.893012e+08 369140800
2004-12-20 1.191681e+09 1168126400
2005-02-17 8.905975e+08 1518473600
2002-12-03 2.004725e+08 227869600
2005-10-10 8.430103e+08 50750560
我有一个数据框:
import yfinance as yf
df = yf.download('AAPL',
start='2001-01-01',
end='2005-12-31',
progress=False)
然后我将它分成比例为 80:20 的训练测试集。这是一些代码来检查我的火车和测试集的索引。
train_df.index
输出为
test_df.index
输出为
从训练数据得到模型后,我用252个测试数据做预测,结果是
如何将预测输出更改为具有日期时间 %Y%m%d 索引而不是整数索引的数据帧?我在这个Whosebug上看了很多文章和答案,但是我还没有找到解决办法。
您可以做的一件事是简单地在模型 training/inference 之前保存日期时间索引,然后将其重新加入到 RangeIndex 中。
即:
time_index = df.reset_index()[['utc']] #replace utc with your index name
df = df.reset_index()
训练模型,然后加入 RangeIndex。然后将索引设置回 DatetimeIndex。
prediction = prediction.join(time_index)
prediction.set_index('utc', inplace=True)
工作示例:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':np.arange(10)}, index=pd.date_range('2021-01-01', '2021-01-10'))
df.index.name = 'Date'
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
#Some arbitrary prediction dataframe with a RangeIndex
prediction = pd.DataFrame({'predictions':np.arange(0,10)})
#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)
#Sets index to the time_index
prediction.set_index('Date', inplace=True)
您现在将拥有如下所示的数据框:
predictions
Date
2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
2021-01-08 7
2021-01-09 8
2021-01-10 9
为了开车回家,这是一个使用您的数据源的具体示例:
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = yf.download('AAPL',
start='2001-01-01',
end='2005-12-31',
progress=False)
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
df = df.reset_index()
#Assuming we predict Volume
y = df[['Volume']]
X = df.drop(columns=['Volume', 'Date'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
#Predict values, transpose to fit into dataframe
predicted_values = model.predict(X_test).T[0]
#Create prediction dataframe
prediction = pd.DataFrame({'y-pred':predicted_values}, index=X_test.index)
#join test or true data to prediction for comparison
prediction = prediction.join(y_test)
#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)
#Sets index to the time_index
prediction.set_index('Date', inplace=True)
这导致:
y-pred Volume
Date
2001-07-26 3.893012e+08 369140800
2004-12-20 1.191681e+09 1168126400
2005-02-17 8.905975e+08 1518473600
2002-12-03 2.004725e+08 227869600
2005-10-10 8.430103e+08 50750560