SciKit-learn 用于振荡数据的数据驱动回归

Question

长期潜伏者第一次海报。

我的数据大致服从 y=sin(time) 分布，但也取决于时间以外的其他变量。在相关性方面，由于目标 y 变量振荡，因此与时间的统计相关性几乎为零，但 y 显然非常依赖于时间。

目标是预测目标变量的未来值。我想避免使用模型的明确假设，而是依赖数据驱动模型和机器学习，所以我尝试使用 sklearn 中的回归方法。

我尝试了以下方法（参数是从示例和其他线程中盲目复制的）：

LogisticRegression()
QDA()
GridSearchCV(SVR(kernel='rbf', gamma=0.1), cv=5,
                   param_grid={"C": [1e0, 1e1, 1e2, 1e3],
                               "gamma": np.logspace(-2, 2, 5)})
GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5,
                  param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
                              "gamma": np.logspace(-2, 2, 5)})
GradientBoostingRegressor(loss='quantile', alpha=0.95,
                                n_estimators=250, max_depth=3,
                                learning_rate=.1, min_samples_leaf=9,
                                min_samples_split=9)
DecisionTreeRegressor(max_depth=4)
AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
                          n_estimators=300, random_state=rng)
RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)

结果分为两种不同的失败类别：

时间场没有影响，可能是由于目标变量的振荡行为缺乏相关性。然而，来自其他变量的次级效应允许对未来时间范围进行适度的预测能力（这些其他变量与目标变量具有简单的相关性）
当将 predict() 应用于训练时间范围时，预测相对于观察结果接近完美，但是当给定未来时间范围（训练中未使用数据）时，预测值保持不变。

下面是我如何进行训练和测试的：

weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['Days'] = (weather_df.index-datetime.datetime(2005,1,1)).days
ts = pd.DataFrame({'Temperature':weather_df['Mean TemperatureC'].ix[:'2015-1-1'],
                   'Humidity':weather_df[' Mean Humidity'].ix[:'2015-1-1'],
                   'Visibility':weather_df[' Mean VisibilityKm'].ix[:'2015-1-1'],
                   'Wind':weather_df[' Mean Wind SpeedKm/h'].ix[:'2015-1-1'],
                   'Time':weather_df['Days'].ix[:'2015-1-1'] 
                   })
start_test = datetime.datetime(2012,1,1)
ts_train = ts[ts.index < start_test]
ts_test = ts
data_train = np.array(ts_train.Humidity, ts_test.Time)[np.newaxis]
data_target = np.array(ts_train.Temperature)[np.newaxis].ravel()
model.fit(data_train.T, data_target.T)
data_test = np.array(ts_test.Humidity, ts_test.Time)[np.newaxis]
pred = model.predict(data_test.T)
ts_test['Pred'] = pred

是否有我 could/should 用于此问题的回归模型，如果有，合适的选项和参数是什么？

（此外，我对 sklearn 中时间对象的处理远非优雅，所以我很乐意在那里接受建议。）

Answer 1

这是我对两种结果类型的猜测：

.days 不会将您的索引转换为在训练样本和测试样本之间重复的形式。因此它成为数据集中每个日期的唯一值。

因此，您的模型要么忽略 days（第一个结果），要么您的模型过度拟合 days 特征（第二个结果），导致模型在您的测试数据上表现不佳。

建议：

如果您的数据集足够大（看起来像是从 2005 年开始的），请尝试改用 dayofyear 或 weekofyear，这样您的模型就会从日期信息中得到一些可概括的东西。

Answer 2

同意@zemekeneng 的观点，时间应该按相应的时间段（如 24 小时、12 个月等）计算

除此之外，我想提醒您在选择特征或模型时使用先验知识。由于您已经知道您的数据很可能遵循 sin(x)，因此即使在数据驱动方法中也应该使用它。

我们知道 sin(x) 可以近似为 x - x^3/3! + x^5/5! - x^7/7! 那么这些应该被用作特征。 None 您使用的型号可能包含这些功能。一种方法是自己创建这些高阶特征并将其连接到您的其他特征。那么带调节的线性模型可能会给你合理的结果。

SciKit-learn 用于振荡数据的数据驱动回归

SciKit-learn for data driven regression of oscillating data

python

regression

time-series

prediction

scikit-learn