ARMA.predict 样本外预测不适用于浮点数?
ARMA.predict for out-of sample forecast does not work with floating points?
在我开发了用于样本内分析的小型 ARMAX 预测模型后,我想预测一些样本外的数据。
我用于预测计算的时间序列从 2013-01-01 开始到 2013-12-31 结束!
这是我正在使用的数据:
hr = np.loadtxt("Data_2013_17.txt")
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr, index=index)
holidays = ['2013-1-1', '2013-3-29', '2013-4-1', '2013-5-1', '2013-5-9', '2013-5-20', '2013-10-3', '2013-12-25', '2013-12-26']
# holidays for all Bundesländer
idx = df.asfreq('B').index - DatetimeIndex(holidays)
indexed_df = df.reindex(idx)
# indexed_df = df.asfreq('B') (includes holidays)
# 'D'=day
#'B'=business day
# W@MON=shows only mondays
# external variable
hr_ = np.loadtxt("Data_2_2013.txt")
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr_, index=index)
idx2 = df.asfreq('B').index - DatetimeIndex(holidays)
external_df1 = df.reindex(idx2)
external_df = external_df1.fillna(external_df1.mean())
输出:
0
2013-01-02 49.56
2013-01-03 48.09
2013-01-04 36.79
2013-01-07 60.84
2013-01-08 59.72
2013-01-09 61.88
2013-01-10 57.95
2013-01-11 56.29
2013-01-14 57.89
2013-01-15 64.49
2013-01-16 58.92
2013-01-17 62.30
2013-01-18 55.92
2013-01-21 55.67
2013-01-22 60.73
2013-01-23 60.12
2013-01-24 65.70
2013-01-25 55.15
2013-01-28 51.79
2013-01-29 39.69
2013-01-30 37.90
2013-01-31 37.60
2013-02-01 41.26
2013-02-04 29.18
2013-02-05 39.55
2013-02-06 47.57
2013-02-07 51.97
2013-02-08 46.95
2013-02-11 42.79
2013-02-12 51.83
... ...
2013-11-18 58.04
2013-11-19 62.96
2013-11-20 63.90
2013-11-21 64.09
2013-11-22 64.78
2013-11-25 59.59
2013-11-26 70.69
2013-11-27 61.57
2013-11-28 47.87
2013-11-29 34.61
2013-12-02 68.77
2013-12-03 77.84
2013-12-04 63.09
2013-12-05 40.94
2013-12-06 38.60
2013-12-09 65.79
2013-12-10 68.98
2013-12-11 77.86
2013-12-12 76.44
2013-12-13 85.90
2013-12-16 53.51
2013-12-17 73.67
2013-12-18 59.76
2013-12-19 53.11
2013-12-20 38.33
2013-12-23 36.93
2013-12-24 11.30
2013-12-27 30.32
2013-12-30 39.94
2013-12-31 31.27
[252 rows x 1 columns]
0
2013-01-02 70770
2013-01-03 74155
2013-01-04 74286
2013-01-07 75360
2013-01-08 76910
2013-01-09 78561
2013-01-10 77427
2013-01-11 75260
2013-01-14 78738
2013-01-15 78286
2013-01-16 79568
2013-01-17 79761
2013-01-18 77518
2013-01-21 80089
2013-01-22 79915
2013-01-23 78607
2013-01-24 79761
2013-01-25 77908
2013-01-28 79873
2013-01-29 80535
2013-01-30 76340
2013-01-31 78244
2013-02-01 77749
2013-02-04 79125
2013-02-05 79001
2013-02-06 77837
2013-02-07 77495
2013-02-08 75372
2013-02-11 73856
2013-02-12 77494
... ...
2013-11-18 76292
2013-11-19 77420
2013-11-20 74993
2013-11-21 76658
2013-11-22 74769
2013-11-25 78347
2013-11-26 77756
2013-11-27 79648
2013-11-28 80075
2013-11-29 78587
2013-12-02 76867
2013-12-03 76070
2013-12-04 80344
2013-12-05 81736
2013-12-06 79617
2013-12-09 78085
2013-12-10 78430
2013-12-11 78120
2013-12-12 77735
2013-12-13 75872
2013-12-16 78651
2013-12-17 76180
2013-12-18 75867
2013-12-19 76018
2013-12-20 71101
2013-12-23 66841
2013-12-24 64557
2013-12-27 66747
2013-12-30 64787
2013-12-31 61101
[252 rows x 1 columns]
Descriptive statistics of ts:
0
count 252.000000
mean 44.583651
std 11.708938
min 11.300000
25% 34.597500
50% 44.200000
75% 51.947500
max 85.900000
Skewness of endog_var: [ 0.44315988]
Kurtsosis of endog_var: [ 3.18049689]
Correlation hr & hr_: (0.71074420030220553, 2.0635001219278823e-57)
Augmented Dickey-Fuller Test for endog_var: (-2.9282259926181839, 0.042162780619902182, {'5%': -2.8698573654386559, '1%': -3.4492269328800189, '10%': -2.5712010851306641}, <statsmodels.tsa.stattools.ResultsStore object at 0x111e2ca50>)
p和q值的选择:
在:
arma_mod = sm.tsa.ARMA(indexed_df, (3,3), external_df).fit()
z = arma_mod.params
打印 'P- and Q-Values:'
打印 z
输出:
P- and Q-Values:
const 19.674538
0 0.000345
ar.L1.0 -0.062796
ar.L2.0 0.340800
ar.L3.0 0.436345
ma.L1.0 0.613498
ma.L2.0 0.057267
ma.L3.0 -0.415455
dtype: float64
/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
这是我预测样本外的方法:
在:
start_pred = '2014-1-3'
end_pred = '2014-1-3'
predict_price1 = arma_mod1.predict(start_pred, end_pred, external_df)#, dynamic=True)
print ('Predicted Price (ARMAX): {}' .format(predict_price1))
输出:
Traceback (most recent call last):
File "<ipython-input-34-ad7feec95e4a>", line 6, in <module>
predict_price1 = arma_mod1.predict(start_pred, end_pred, external_df)#, dynamic=True)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/base/wrapper.py", line 92, in wrapper
return data.wrap_output(func(results, *args, **kwargs), how)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 1441, in predict
return self.model.predict(self.params, start, end, exog, dynamic)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 711, in predict
start = self._get_predict_start(start, dynamic)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 646, in _get_predict_start
method)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 376, in _validate
start = _index_date(start, dates)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/base/datetools.py", line 57, in _index_date
"an integer" % date)
ValueError: There is no frequency for these dates and date 2014-01-03 00:00:00 is not in dates index. Try giving a date that is in the dates index or use an integer
我不明白这个错误!
arima 源代码,即 'datetools.py' 告诉我以下内容:
except KeyError as err:
freq = _infer_freq(dates)
if freq is None:
#TODO: try to intelligently roll forward onto a date in the
# index. Waiting to drop pandas 0.7.x support so this is
# cleaner to do.
raise ValueError("There is no frequency for these dates and "
"date %s is not in dates index. Try giving a "
"date that is in the dates index or use "
"an integer" % date)
# we can start prediction at the end of endog
if _idx_from_dates(dates[-1], date, freq) == 1:
return len(dates)
raise ValueError("date %s not in date index. Try giving a "
"date that is in the dates index or use an integer"
% date)
def _date_from_idx(d1, idx, freq):
"""
Returns the date from an index beyond the end of a date series.
d1 is the datetime of the last date in the series. idx is the
index distance of how far the next date should be from d1. Ie., 1 gives
the next date from d1 at freq.
Notes
-----
This does not do any rounding to make sure that d1 is actually on the
offset. For now, this needs to be taken care of before you get here.
"""
所以这意味着应该可以预测样本外。我只是不明白我需要在哪里以及如何更改我的对象?!
我找到了一些较旧的帖子,但他们都不告诉我该怎么做:Python out of sample forecasting ARIMA predict()
和 https://stats.stackexchange.com/questions/76160/im-not-sure-that-statsmodels-is-predicting-out-of-sample
如何根据上面给定的信息预测样本外的数据?
非常感谢帮助
两个问题。如错误消息所示,“2014-1-3”不在您的数据中。正如文档中应该提到的那样,您需要在数据的一个时间步长内开始预测。
第二个问题,您的数据没有定义的频率。通过从工作日频率数据中删除假期,您将无法了解第二天是什么。我们现在无法知道第二天应该是什么。您可以为 pandas 编写自定义日期偏移量,但这需要一些工作。
最简单的解决方法就是使用 numpy 数组并删除 pandas DatetimeIndex。
我在 blackarbs 上遇到的解决方案,用于对由 pandas DatetimeIndex
索引的时间序列进行样本外预测
他们 运行 arma.forecast() 用于整数索引数量的数据点并将输出拼接成数据帧。
pd.date_range 调用将整数索引转换为超出原始数据样本的日期
#ts=your data
n_steps=12
idx = pd.date_range(ts.index[-1], periods=n_steps, freq='D')
f, err95, ci95 = mdl.forecast(steps=n_steps) # 95% CI
_, err99, ci99 = mdl.forecast(steps=n_steps, alpha=0.01) # 99% CI
fc_95 = pd.DataFrame(np.column_stack([f, ci95]),
index=idx, columns=['forecast','lower_ci_95','upper_ci_95'])
fc_99 = pd.DataFrame(np.column_stack([ci99]),
index=idx, columns=['lower_ci_99', 'upper_ci_99'])
fc_all = fc_95.combine_first(fc_99)
fc_all.head()
在我开发了用于样本内分析的小型 ARMAX 预测模型后,我想预测一些样本外的数据。
我用于预测计算的时间序列从 2013-01-01 开始到 2013-12-31 结束!
这是我正在使用的数据:
hr = np.loadtxt("Data_2013_17.txt")
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr, index=index)
holidays = ['2013-1-1', '2013-3-29', '2013-4-1', '2013-5-1', '2013-5-9', '2013-5-20', '2013-10-3', '2013-12-25', '2013-12-26']
# holidays for all Bundesländer
idx = df.asfreq('B').index - DatetimeIndex(holidays)
indexed_df = df.reindex(idx)
# indexed_df = df.asfreq('B') (includes holidays)
# 'D'=day
#'B'=business day
# W@MON=shows only mondays
# external variable
hr_ = np.loadtxt("Data_2_2013.txt")
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr_, index=index)
idx2 = df.asfreq('B').index - DatetimeIndex(holidays)
external_df1 = df.reindex(idx2)
external_df = external_df1.fillna(external_df1.mean())
输出:
0
2013-01-02 49.56
2013-01-03 48.09
2013-01-04 36.79
2013-01-07 60.84
2013-01-08 59.72
2013-01-09 61.88
2013-01-10 57.95
2013-01-11 56.29
2013-01-14 57.89
2013-01-15 64.49
2013-01-16 58.92
2013-01-17 62.30
2013-01-18 55.92
2013-01-21 55.67
2013-01-22 60.73
2013-01-23 60.12
2013-01-24 65.70
2013-01-25 55.15
2013-01-28 51.79
2013-01-29 39.69
2013-01-30 37.90
2013-01-31 37.60
2013-02-01 41.26
2013-02-04 29.18
2013-02-05 39.55
2013-02-06 47.57
2013-02-07 51.97
2013-02-08 46.95
2013-02-11 42.79
2013-02-12 51.83
... ...
2013-11-18 58.04
2013-11-19 62.96
2013-11-20 63.90
2013-11-21 64.09
2013-11-22 64.78
2013-11-25 59.59
2013-11-26 70.69
2013-11-27 61.57
2013-11-28 47.87
2013-11-29 34.61
2013-12-02 68.77
2013-12-03 77.84
2013-12-04 63.09
2013-12-05 40.94
2013-12-06 38.60
2013-12-09 65.79
2013-12-10 68.98
2013-12-11 77.86
2013-12-12 76.44
2013-12-13 85.90
2013-12-16 53.51
2013-12-17 73.67
2013-12-18 59.76
2013-12-19 53.11
2013-12-20 38.33
2013-12-23 36.93
2013-12-24 11.30
2013-12-27 30.32
2013-12-30 39.94
2013-12-31 31.27
[252 rows x 1 columns]
0
2013-01-02 70770
2013-01-03 74155
2013-01-04 74286
2013-01-07 75360
2013-01-08 76910
2013-01-09 78561
2013-01-10 77427
2013-01-11 75260
2013-01-14 78738
2013-01-15 78286
2013-01-16 79568
2013-01-17 79761
2013-01-18 77518
2013-01-21 80089
2013-01-22 79915
2013-01-23 78607
2013-01-24 79761
2013-01-25 77908
2013-01-28 79873
2013-01-29 80535
2013-01-30 76340
2013-01-31 78244
2013-02-01 77749
2013-02-04 79125
2013-02-05 79001
2013-02-06 77837
2013-02-07 77495
2013-02-08 75372
2013-02-11 73856
2013-02-12 77494
... ...
2013-11-18 76292
2013-11-19 77420
2013-11-20 74993
2013-11-21 76658
2013-11-22 74769
2013-11-25 78347
2013-11-26 77756
2013-11-27 79648
2013-11-28 80075
2013-11-29 78587
2013-12-02 76867
2013-12-03 76070
2013-12-04 80344
2013-12-05 81736
2013-12-06 79617
2013-12-09 78085
2013-12-10 78430
2013-12-11 78120
2013-12-12 77735
2013-12-13 75872
2013-12-16 78651
2013-12-17 76180
2013-12-18 75867
2013-12-19 76018
2013-12-20 71101
2013-12-23 66841
2013-12-24 64557
2013-12-27 66747
2013-12-30 64787
2013-12-31 61101
[252 rows x 1 columns]
Descriptive statistics of ts:
0
count 252.000000
mean 44.583651
std 11.708938
min 11.300000
25% 34.597500
50% 44.200000
75% 51.947500
max 85.900000
Skewness of endog_var: [ 0.44315988]
Kurtsosis of endog_var: [ 3.18049689]
Correlation hr & hr_: (0.71074420030220553, 2.0635001219278823e-57)
Augmented Dickey-Fuller Test for endog_var: (-2.9282259926181839, 0.042162780619902182, {'5%': -2.8698573654386559, '1%': -3.4492269328800189, '10%': -2.5712010851306641}, <statsmodels.tsa.stattools.ResultsStore object at 0x111e2ca50>)
p和q值的选择:
在: arma_mod = sm.tsa.ARMA(indexed_df, (3,3), external_df).fit() z = arma_mod.params 打印 'P- and Q-Values:' 打印 z
输出:
P- and Q-Values:
const 19.674538
0 0.000345
ar.L1.0 -0.062796
ar.L2.0 0.340800
ar.L3.0 0.436345
ma.L1.0 0.613498
ma.L2.0 0.057267
ma.L3.0 -0.415455
dtype: float64
/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
这是我预测样本外的方法:
在:
start_pred = '2014-1-3'
end_pred = '2014-1-3'
predict_price1 = arma_mod1.predict(start_pred, end_pred, external_df)#, dynamic=True)
print ('Predicted Price (ARMAX): {}' .format(predict_price1))
输出:
Traceback (most recent call last):
File "<ipython-input-34-ad7feec95e4a>", line 6, in <module>
predict_price1 = arma_mod1.predict(start_pred, end_pred, external_df)#, dynamic=True)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/base/wrapper.py", line 92, in wrapper
return data.wrap_output(func(results, *args, **kwargs), how)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 1441, in predict
return self.model.predict(self.params, start, end, exog, dynamic)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 711, in predict
start = self._get_predict_start(start, dynamic)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 646, in _get_predict_start
method)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/arima_model.py", line 376, in _validate
start = _index_date(start, dates)
File "/Applications/anaconda/lib/python2.7/site-packages/statsmodels-0.6.1-py2.7-macosx-10.5-x86_64.egg/statsmodels/tsa/base/datetools.py", line 57, in _index_date
"an integer" % date)
ValueError: There is no frequency for these dates and date 2014-01-03 00:00:00 is not in dates index. Try giving a date that is in the dates index or use an integer
我不明白这个错误!
arima 源代码,即 'datetools.py' 告诉我以下内容:
except KeyError as err:
freq = _infer_freq(dates)
if freq is None:
#TODO: try to intelligently roll forward onto a date in the
# index. Waiting to drop pandas 0.7.x support so this is
# cleaner to do.
raise ValueError("There is no frequency for these dates and "
"date %s is not in dates index. Try giving a "
"date that is in the dates index or use "
"an integer" % date)
# we can start prediction at the end of endog
if _idx_from_dates(dates[-1], date, freq) == 1:
return len(dates)
raise ValueError("date %s not in date index. Try giving a "
"date that is in the dates index or use an integer"
% date)
def _date_from_idx(d1, idx, freq):
"""
Returns the date from an index beyond the end of a date series.
d1 is the datetime of the last date in the series. idx is the
index distance of how far the next date should be from d1. Ie., 1 gives
the next date from d1 at freq.
Notes
-----
This does not do any rounding to make sure that d1 is actually on the
offset. For now, this needs to be taken care of before you get here.
"""
所以这意味着应该可以预测样本外。我只是不明白我需要在哪里以及如何更改我的对象?!
我找到了一些较旧的帖子,但他们都不告诉我该怎么做:Python out of sample forecasting ARIMA predict() 和 https://stats.stackexchange.com/questions/76160/im-not-sure-that-statsmodels-is-predicting-out-of-sample
如何根据上面给定的信息预测样本外的数据?
非常感谢帮助
两个问题。如错误消息所示,“2014-1-3”不在您的数据中。正如文档中应该提到的那样,您需要在数据的一个时间步长内开始预测。
第二个问题,您的数据没有定义的频率。通过从工作日频率数据中删除假期,您将无法了解第二天是什么。我们现在无法知道第二天应该是什么。您可以为 pandas 编写自定义日期偏移量,但这需要一些工作。
最简单的解决方法就是使用 numpy 数组并删除 pandas DatetimeIndex。
我在 blackarbs 上遇到的解决方案,用于对由 pandas DatetimeIndex
索引的时间序列进行样本外预测他们 运行 arma.forecast() 用于整数索引数量的数据点并将输出拼接成数据帧。
pd.date_range 调用将整数索引转换为超出原始数据样本的日期
#ts=your data
n_steps=12
idx = pd.date_range(ts.index[-1], periods=n_steps, freq='D')
f, err95, ci95 = mdl.forecast(steps=n_steps) # 95% CI
_, err99, ci99 = mdl.forecast(steps=n_steps, alpha=0.01) # 99% CI
fc_95 = pd.DataFrame(np.column_stack([f, ci95]),
index=idx, columns=['forecast','lower_ci_95','upper_ci_95'])
fc_99 = pd.DataFrame(np.column_stack([ci99]),
index=idx, columns=['lower_ci_99', 'upper_ci_99'])
fc_all = fc_95.combine_first(fc_99)
fc_all.head()