从 auto.arima 生成的漂移模型(预测包)中获取一步预测时出错

Error in obtaining one-step forecasts from auto.arima generated drift model (forecast package)

我正在尝试从具有两个外部回归变量的 ARIMA 模型中提取一步预测,如 Hyndman 教授的博客 here 中所述。我先用auto.arima生成一个模型,然后把这个模型应用到全集。

该代码适用于我样本中的第一家公司。然而,第二家公司在提取一步预测时导致错误:

Error in `[<-.default`(`*tmp*`, , "drift", value = c(1.00000000000909,  :
subscript out of bounds

以下代码包含导致错误的时间序列:

df <-structure(list(fYearQtr = c(2004.5, 2004.75, 2005, 2005.25, 2005.5, 
2005.75, 2006, 2006.25, 2006.5, 2006.75, 2007, 2007.25, 2007.5, 
2007.75, 2008, 2008.25, 2008.5, 2008.75, 2009, 2009.25, 2009.5, 
2009.75, 2010, 2010.25, 2010.5, 2010.75, 2011, 2011.25, 2011.5, 
2011.75, 2012, 2012.25, 2012.5, 2012.75, 2013, 2013.25, 2013.5, 
2013.75, 2014, 2014.25), Sales = c(2014, 2350, 3490, 3243, 3520, 
3678, 5749, 4359, 4370, 4837, 7115, 5264, 5410, 6217, 9608, 7512, 
7464, 7895, 11880, 9084, 9734, 12207, 15683, 13499, 15700, 20343, 
26741, 24667, 28571, 28270, 46333, 39186, 35023, 35966, 54512, 
43603, 35323, 37472, 57594, 45646), last_SVI = c(17, 23, 25, 
20, 20, 28, 31, 22, 21, 30, 32, 22, 29, 34, 39, 26, 24, 34, 38, 
24, 28, 33, 34, 22, 38, 34, 38, 34, 34, 40, 52, 34, 34, 58, 54, 
31, 32, 53, 48, 30), SD_SVI = c(0.898717034272917, 1.66410058867569, 
2.35883500145783, 2.49615088301353, 1.48064435037847, 2.87562702192596, 
3.45854571482559, 2.26738299389972, 1.05003052458683, 3.67772226053586, 
3.19855736712181, 5.65685424949238, 2.66024868704471, 5.10153320342434, 
3.77236918007361, 2.79880927062444, 2.59437260831385, 3.0697030675746, 
4.66162731573098, 2.33973480855395, 3.43063124938119, 3.71069141390533, 
3.78255103173669, 9.43873633436932, 4.36918111203273, 3.44368615860597, 
4.85032380626706, 3.51188458428425, 2.16617351389673, 3.01066480434182, 
13.8264358990424, 5.36966789786234, 3.3166247903554, 14.2644438718921, 
7.43260316064229, 2.96777564982468, 4.21383557538856, 12.3594664228036, 
6.83880331412088, 2.01913919206257)), .Names = c("fYearQtr", 
"Sales", "xReg1", "xReg2"), row.names = c(NA, -40L), class = "data.frame")

示例数据:

head(df)
    fYearQtr Sales xReg1    xReg2
1  2004.50  2014    17 0.898717
2  2004.75  2350    23 1.664101
3  2005.00  3490    25 2.358835
4  2005.25  3243    20 2.496151
5  2005.50  3520    20 1.480644
6  2005.75  3678    28 2.875627

构建时间序列对象,train/test设置并提取一步预测:

require(forecast)
TS <- ts(df[,2:4], start = c(2004,3), end = c(2014,2), frequency=4)
TS.TRAIN <- window(TS, end=2011.4)
TS.TEST <- window(TS, start=2011.5)

# Build an arima model
MODEL <- auto.arima(TS.TRAIN[,'Sales'], xreg=TS.TRAIN[,colnames(TS.TRAIN) %in% c('xReg1', 'xReg2')])
FCAST <- forecast(MODEL, xreg=TS.TEST[,colnames(TS.TEST) %in% c('xReg1', 'xReg2')])
# Resulting model: ARIMA(0,1,0)(1,0,1)[4] with drift

现在提取 1 步预测:

refit <- Arima(TS[,'Sales'], model=MODEL, xreg=TS[,colnames(TS) %in% c('xReg1', 'xReg2')])
## Error in `[<-.default`(`*tmp*`, , "drift", value = c(1.00000000000909,  : 
# subscript out of bounds

令人困惑的部分:使用以下数据框(不同公司)时,完全相同的代码有效:

#########################################
# Other example: works just fine?
df_noissues <- structure(list(fQtrYear = c(2004.5, 2004.75, 2005, 2005.25, 2005.5, 
  2005.75, 2006, 2006.25, 2006.5, 2006.75, 2007, 2007.25, 2007.5, 
  2007.75, 2008, 2008.25, 2008.5, 2008.75, 2009, 2009.25, 2009.5, 
  2009.75, 2010, 2010.25, 2010.5, 2010.75, 2011, 2011.25, 2011.5, 
  2011.75, 2012, 2012.25, 2012.5, 2012.75, 2013, 2013.25, 2013.5, 
  2013.75, 2014, 2014.25), Sales = c(5818, 5979, 6221, 6410, 6401, 
 6536, 7111, 7797, 7631, 7840, 7908, 8066, 7387, 7387, 6998, 7245, 
 6970, 5688, 4147, 4244, 4615, 5433, 4887, 5187, 5287, 5652, 5958, 
 6585, 6419, 5989, 6006, 5963, 5833, 5898, 5833, 5849, 5765, 5585, 
 5454, 5836), mean_SVI = c(61.1666666666667, 47.9166666666667, 
 48.5833333333333, 51.4166666666667, 56, 51.8461538461538, 50.1666666666667, 
 60.75, 53.1538461538462, 48.9230769230769, 53, 53.6923076923077, 
 55.8461538461538, 46.3333333333333, 51.25, 54.1666666666667, 
 52.4166666666667, 50.4166666666667, 54.4166666666667, 49.3333333333333, 
 49.1666666666667, 39.5833333333333, 41.8333333333333, 43.9166666666667, 
 39.8333333333333, 37.1666666666667, 45.25, 45.9166666666667, 
 45.8333333333333, 39.7692307692308, 52.8461538461538, 60.6153846153846, 
 44.0769230769231, 37.75, 47.5, 45.1666666666667, 42.1666666666667, 
 39.25, 47.25, 47.4166666666667), SD_SVI = c(9.29157324317757, 
 11.0737883255365, 8.37157890324957, 6.08213977498251, 7.80442764775809, 
 9.09987320283598, 6.16195561244131, 11.2583302491977, 10.4390784542678, 
 8.38114489455884, 9.69535971483266, 11.4118696641159, 6.84161474417351, 
 8.96795642408249, 3.22278817739603, 6.23528570947538, 4.73782330790941, 
 9.3269729410149, 16.1777531496094, 10.9903538972992, 9.64679252708412, 
 11.1147595020261, 11.1586357371836, 7.22946412365063, 7.99810583636507, 
 6.89971453076579, 7.97866473221497, 3.89541299790439, 6.24984848301189, 
 7.5294294400245, 17.0775005677361, 12.6855459844296, 6.00640683578153, 
 6.77059148752228, 6.98700091728789, 6.97832140969228, 3.90415474109624, 
 4.39265916563698, 3.64629326103298, 5.08935311719625)), .Names = c("fQtrYear", 
"Sales", "xReg1", "xReg2"), row.names = c(NA, -40L), class = "data.frame")

示例数据:

head(df_noissues)
  fQtrYear Sales    xReg1     xReg2
1  2004.50  5818 61.16667  9.291573
2  2004.75  5979 47.91667 11.073788
3  2005.00  6221 48.58333  8.371579
4  2005.25  6410 51.41667  6.082140
5  2005.50  6401 56.00000  7.804428
6  2005.75  6536 51.84615  9.099873    

运行 构建test/training集合& ARIMA模型的相同代码:

TS <- ts(df_noissues[,2:4], start = c(2004,3), end = c(2014,2), frequency=4)
TS.TRAIN <- window(TS, end=2011.4)
TS.TEST <- window(TS, start=2011.5)

# Build an arima model
MODEL <- auto.arima(TS.TRAIN[,'Sales'], xreg=TS.TRAIN[,colnames(TS.TRAIN) %in% c('xReg1', 'xReg2')])
FCAST <- forecast(MODEL, xreg=TS.TEST[,colnames(TS.TEST) %in% c('xReg1', 'xReg2')])

提取 1 步预测:没有错误。

refit <- Arima(TS[,'Sales'], model=MODEL, xreg=TS[,colnames(TS) %in% c('xReg1', 'xReg2')])
## ARIMA(2,0,0)(0,1,0)[4]  

除了生成的模型存在差异(有/无漂移)之外,我似乎真的无法理解是什么导致了这种情况。 运行 auto.arima allowdrift=FALSE 确实似乎解决了这个问题。

问题是由于 xreg 参数的列名引起的。在第一种情况下,有一个漂移项将一列添加到 xreg,并触发列名称的更改。看模型就知道了。

> MODEL
Series: TS.TRAIN[, "Sales"] 
ARIMA(0,1,0)(1,0,1)[4] with drift         

Coefficients:
        sar1    sma1     drift  xreg.xReg1  xreg.xReg2
      0.8135  0.6554  1475.542     38.1461     84.5589
s.e.  0.1314  0.5816  1071.337     45.7297     74.2390

相比
> MODEL
Series: TS.TRAIN[, "Sales"] 
ARIMA(2,0,0)(0,1,0)[4]                    

Coefficients:
         ar1      ar2     xReg1    xReg2
      1.3394  -0.5685  -14.4072   4.3869
s.e.  0.1766   0.1783   16.3416  24.3968

我会把它添加到错误列表中,看看能不能找到解决方案。

作为变通方法,您可以按如下方式使用漂移重新拟合模型:

TS <- ts(df[,2:4], start = c(2004,3), end = c(2014,2), frequency=4)
TS.TRAIN <- window(TS, end=2011.4)
z <- TS.TRAIN[,colnames(TS.TRAIN) %in% c('xReg1', 'xReg2')]
MODEL <- auto.arima(TS.TRAIN[,'Sales'], xreg=z)
MODEL2 <- Arima(TS.TRAIN[,'Sales'], order=MODEL$arma[c(1,6,2)],
                seasonal=MODEL$arma[c(3,7,4)], xreg=cbind(1:nrow(z),z))
z <- TS[,colnames(TS) %in% c('xReg1', 'xReg2')]
refit <- Arima(TS[,'Sales'], model=MODEL2, xreg=cbind(1:nrow(z),z))