在使用 scikit_learn 和 pandas 训练模型后，如何预测未来的数据（在我的例子中是降雨量）？

Question

我正在训练一个模型来预测未来的降雨数据。我已经完成了模型的训练。我正在使用这个数据集：https://www.kaggle.com/redikod/historical-rainfall-data-in-bangladesh 看起来像这样：

              Station   Yea  Month Day Rainfall dayofyear
1970-01-01  1   Dhaka   1970    1   1   0           1
1970-01-02  1   Dhaka   1970    1   2   0           2
1970-01-03  1   Dhaka   1970    1   3   0           3
1970-01-04  1   Dhaka   1970    1   4   0           4
1970-01-05  1   Dhaka   1970    1   5   0           5

我参考网上找到的一段代码，完成了train和test数据的训练。然后我还检查了预测值与真实值。

这是代码，

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

#data is in local folder
df = pd.read_csv("data.csv")
df.head(5)

df.drop(df[(df['Day']>28) & (df['Month']==2) & (df['Year']%4!=0)].index,inplace=True)
df.drop(df[(df['Day']>29) & (df['Month']==2) & (df['Year']%4==0)].index,inplace=True)
df.drop(df[(df['Day']>30) & ((df['Month']==4)|(df['Month']==6)|(df['Month']==9)|(df['Month']==11))].index,inplace=True)

date = [str(y)+'-'+str(m)+'-'+str(d) for y, m, d in zip(df.Year, df.Month, df.Day)]
df.index = pd.to_datetime(date)
df['date'] = df.index
df['dayofyear']=df['date'].dt.dayofyear
df.drop('date',axis=1,inplace=True)

df.head()
df.size()
df.info()

df.plot(x='Year',y='Rainfall',style='.', figsize=(15,5))

train = df.loc[df['Year'] <= 2015]
test = df.loc[df['Year'] == 2016]
train=train[train['Station']=='Dhaka']
test=test[test['Station']=='Dhaka']

X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_train=train['Rainfall']
X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_test=test['Rainfall']

from sklearn import svm
from sklearn.svm import SVC
model = svm.SVC(gamma='auto',kernel='linear')
model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)

df1 = pd.DataFrame({'Actual Rainfall': Y_test, 'Predicted Rainfall': Y_pred})  
df1[df1['Predicted Rainfall']!=0].head(10)

在此之后，我尝试实际使用该模型来预测 days/months/years 未来的降雨量。我用了一些，比如一些用于股票价格的（在调整代码之后）。但其中 none 似乎有效。因为我已经训练了模型，所以我认为预测未来几天会很容易。假设，我使用 1970-2015 年的数据进行训练，使用 2016 年的数据进行测试。现在我想预测 2017 年的降雨量。类似的东西。

我的问题是，我怎样才能以直观的方式做到这一点？

如果有人能回答这个问题，我将不胜感激。

编辑@Mercury：这是使用该代码后的实际结果。我怀疑这个模型根本就是运行... 这是实际结果的图像：https://i.stack.imgur.com/81Vk1.png

Answer 1

我注意到这里有一个非常简单的错误：

X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_train=train['Rainfall']
X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_test=test['Rainfall']

您还没有从训练数据中删除 Rainfall 列。

我会做一个大胆的假设，说你在训练和测试中都获得了完美的 100% 准确率，对吧？这就是原因。您的模型看到训练数据中 'Rainfall' 列中出现的任何内容始终是答案，因此它在测试期间正是这样做的，从而获得了完美的结果——但实际上它根本没有预测任何东西！

像这样运行试试看：

X_train=train.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)
Y_train=train['Rainfall']
X_test=test.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)
Y_test=test['Rainfall']

from sklearn import svm
model = svm.SVC(gamma='auto',kernel='linear')
model.fit(X_train, Y_train)
print('Accuracy on training set: {:.2f}%'.format(100*model.score(X_train, Y_train)))
print('Accuracy on testing set: {:.2f}%'.format(100*model.score(X_test, Y_test)))

Answer 2

数据很简单。如果你参加的是kaggle比赛，那么可解释性也不是什么大问题，准确率才是大问题，你可以使用任何复杂的模型并得到好的结果。但是，如果我想要可解释性，那么我会使用深度不超过 4 的决策树。降低深度，您会看到更通用的决策树。它会给你很好的数据洞察力。

一些建议可能是-

完全删除日、月列，该信息已存储在年属性中（闰年并不是什么大问题）。
您只剩下三列，年份、车站和年份。
看看年份列是否重要（决策树的重要决策出现在前 2-3 个深度），如果不重要，则可以取消它。在现实世界中，变化更加难以预测，模型越泛化越好。车站和年份是重要的考虑因素，不容忽视。

然后检查复杂模型，它们是否提高了您的准确性？他们可能。

如果他们这样做，那么就使用它们，或者坚持使用更简单的模型，因为它们的可解释性高，计算时间更快。

在使用 scikit_learn 和 pandas 训练模型后，如何预测未来的数据（在我的例子中是降雨量）？

how do I forecast data (in my case, rainfall) into the future after I have trained a model using scikit_learn and pandas?

python

machine-learning

forecasting

scikit-learn

sklearn-pandas