使用 dplyr 和 do 进行多步预测
Multi-steps forecasting with dplyr and do
dplyr 中的 do 函数可让您快速轻松地制作出许多很酷的模型,但我很难使用这些模型来获得良好的 rolling 预测。
# Data illustration
require(dplyr)
require(forecast)
df <- data.frame(
Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"),
to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1,
Wind = runif(4320, min = 1, max = 5000),
Temp = runif(4320, min = - 20, max = 25),
Price = runif(4320, min = -15, max = 45)
)
我的因子变量是Hour
,我的外生变量是Wind
和temp
,我要预测的是Price
。所以,基本上,我有 24 个模型可以用来进行滚动预测。
现在,我的数据框包含 180 天。我想倒退 100 天,做 1 天的滚动预测,然后能够将其与实际 Price
.
进行比较
进行这种蛮力操作看起来像这样:
# First I fit the data frame to be exactly the right length
# 100 days to start with (2015-03-21 or so), then 99, then 98.., etc.
n <- 100 * 24
# Make the price <- NA so I can replace it with a forecast
df$Price[(nrow(df) - n): (nrow(df) - n + 24)] <- NA
# Now I make df just 81 days long, the estimation period + the first forecast
df <- df[1 : (nrow(df) - n + 24), ]
# The actual do & fit, later termed fx(df)
result <- df %>% group_by(Hour) %>% do ({
historical <- .[!is.na(.$Price), ]
forecasted <- .[is.na(.$Price), c("Date", "Hour", "Wind", "Temp")]
fit <- Arima(historical$Price, xreg = historical[, 3:4], order = c(1, 1, 0))
data.frame(forecasted[],
Price = forecast.Arima(fit, xreg = forecasted[3:4])$mean )
})
result
现在我会将 n
更改为 99 * 24。但是将其循环或应用会很棒,但我只是不知道该怎么做,并且还保存了每个新的预测。
我试过这样的循环,但还没有成功:
# 100 days ago, forecast that day, then the next, etc.
for (n in 1:100) {
nx <- n * 24 * 80 # Because I want to start after 80 days
df[nx:(nx + 23), 5] <- NA # Set prices to NA so I can forecast them
fx(df) # do the function
df.results[n] <- # Write the results into a vector / data frame to save them
# and now rinse and repeat for n + 1
}
对于类似 broom
的解决方案来说真是太棒了加分 :)
首先我会注意到您的 for 循环中存在错误。而不是 n*24*80
你可能意味着 (n+80)*24
。如果您还想包括第 81 天的预测,则循环中的计数器也应该从 0 到 99 而不是 1 到 100。
下面我将尝试为您的问题提供一个优雅的解决方案。首先,我们按照您在 post:
中所做的完全相同的方式定义我们的测试数据框
set.seed(2)
df <- data.frame(
Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"),
to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1,
Wind = runif(4320, min = 1, max = 5000),
Temp = runif(4320, min = - 20, max = 25),
Price = runif(4320, min = -15, max = 45)
)
接下来,我们定义一个函数来执行特定日期的预测。输入参数是正在考虑的数据框和训练集中应该包含的最少训练天数(本例中为 80)。 minTrainingDays+offSet+1
表示我们预测的实际日期。请注意,我们从 0 开始计算偏移量。
forecastOneDay <- function(theData,minTrainingDays,offset)
{
nrTrainingRows <- (minTrainingDays+offset)*24
theForecast <- theData %>%
filter(min_rank(Date) <= nrTrainingRows+24) %>% # Drop future data that we don't need
group_by(Hour) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
}
我们要预测第 81-180 天。换句话说,我们的训练集至少需要 80 天,并且要计算偏移量 0:99
的函数结果。这可以通过简单的 lapply
调用来完成。最后,我们将所有结果合并到一个数据框中:
# Perform one day forecasts for days 81-180
resultList <- lapply(0:99, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)
编辑
在查看了您的 post 和另一个 post 的答案后,我注意到我的答案有两个潜在的问题。首先,您需要 rolling window 80 天的训练数据。但是,在我之前的代码中,所有可用的训练数据都用于拟合模型,而不是仅返回 80 天。其次,代码对 DST 更改不健壮。
这两个问题已在下面的代码中修复。该函数的输入现在也更加直观:训练天数和实际预测天数可以用作输入参数。请注意,在对日期执行操作时,POSIXlt
数据格式可以正确处理 DST、闰年等内容。因为数据框中的日期是 POSIXct
类型,我们需要来回进行小型类型转换才能正确处理。
下面的新代码:
forecastOneDay <- function(theData,nrTrainingDays,predictDay) # predictDay should be greater than nrTrainingDays
{
initialDate <- as.POSIXlt(theData$Date[1]); # First day (midnight hour)
startDate <- initialDate # Beginning of training interval
endDate <- initialDate # End of test interval
startDate$mday <- initialDate$mday + (predictDay-nrTrainingDays-1) # Go back 80 days from predictday
endDate$mday <- startDate$mday + (nrTrainingDays+1) # +1 to include prediction day
theForecast <- theData %>%
filter(Date >= as.POSIXct(startDate),Date < as.POSIXct(endDate)) %>%
group_by(Hour) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
}
# Perform one day forecasts for days 81-180
resultList <- lapply(81:180, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)
结果如下所示:
> head(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour
Date Hour Wind Temp realPrice predictedPrice
1 2015-03-22 00:00:00 1 1691.589 -8.722152 -11.207139 5.918541
2 2015-03-22 01:00:00 2 1790.928 18.098358 3.902686 37.885532
3 2015-03-22 02:00:00 3 1457.195 10.166422 22.193270 34.984164
4 2015-03-22 03:00:00 4 1414.502 4.993783 6.370435 12.037642
5 2015-03-22 04:00:00 5 3020.755 9.540715 25.440357 -1.030102
6 2015-03-22 05:00:00 6 4102.651 2.446729 33.528199 39.607848
> tail(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour
Date Hour Wind Temp realPrice predictedPrice
1 2015-06-29 18:00:00 19 1521.9609 13.6414797 12.884175 -6.7789109
2 2015-06-29 19:00:00 20 555.1534 3.4758159 37.958768 -5.1193514
3 2015-06-29 20:00:00 21 4337.6605 4.7242352 -9.244882 33.6817379
4 2015-06-29 21:00:00 22 3140.1531 0.8127839 15.825230 -0.4625457
5 2015-06-29 22:00:00 23 1389.0330 20.4667234 -14.802268 15.6755880
6 2015-06-29 23:00:00 24 763.0704 9.1646139 23.407525 3.8214642
可以使用 dplyr 创建一个 "rolling" data.frame,如下所示
library(dplyr)
library(lubridate)
WINDOW_SIZE_DAYS <- 80
df2 <- df %>%
mutate(Day = yday(Date)) %>%
replicate( n = WINDOW_SIZE_DAYS, simplify = FALSE ) %>%
bind_rows %>%
group_by(Date) %>%
mutate(Replica_Num = 1:n() ) %>%
mutate(Day_Group_id = Day + Replica_Num - 1 ) %>%
ungroup() %>%
group_by(Day_Group_id) %>%
filter( n() >= 24*WINDOW_SIZE_DAYS - 1 ) %>%
select( -Replica_Num ) %>%
arrange(Date) %>%
ungroup()
基本上,此代码会根据需要复制观察结果,并为每个 80 天的块分配相应的 Day_Group_id
。
这样做的目的是允许人们使用 group_by(Day_Group_id)
以便 运行 在每个 80 天的块上分别使用模型。
以后大家可以随意使用了。例如,只是 copy/pasting 上面的 arima 代码如下:
df3 <- df2 %>%
group_by(Day_Group_id, Hour) %>%
arrange(Date) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
请注意:
这里使用filter(n() >= 24*WINDOW_SIZE_DAYS - 1)
代替filter(n() == 24*WINDOW_SIZE_DAYS)
,是为了select完整的80天windows。这是由于 2015-03-08
的夏令时调整所致。数据集中不存在小时 2015-03-08 02:00:00
,因为它从 2015-03-08 01:00:00
直接跳到 2015-03-08 03:00:00
。
dplyr 中的 do 函数可让您快速轻松地制作出许多很酷的模型,但我很难使用这些模型来获得良好的 rolling 预测。
# Data illustration
require(dplyr)
require(forecast)
df <- data.frame(
Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"),
to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1,
Wind = runif(4320, min = 1, max = 5000),
Temp = runif(4320, min = - 20, max = 25),
Price = runif(4320, min = -15, max = 45)
)
我的因子变量是Hour
,我的外生变量是Wind
和temp
,我要预测的是Price
。所以,基本上,我有 24 个模型可以用来进行滚动预测。
现在,我的数据框包含 180 天。我想倒退 100 天,做 1 天的滚动预测,然后能够将其与实际 Price
.
进行这种蛮力操作看起来像这样:
# First I fit the data frame to be exactly the right length
# 100 days to start with (2015-03-21 or so), then 99, then 98.., etc.
n <- 100 * 24
# Make the price <- NA so I can replace it with a forecast
df$Price[(nrow(df) - n): (nrow(df) - n + 24)] <- NA
# Now I make df just 81 days long, the estimation period + the first forecast
df <- df[1 : (nrow(df) - n + 24), ]
# The actual do & fit, later termed fx(df)
result <- df %>% group_by(Hour) %>% do ({
historical <- .[!is.na(.$Price), ]
forecasted <- .[is.na(.$Price), c("Date", "Hour", "Wind", "Temp")]
fit <- Arima(historical$Price, xreg = historical[, 3:4], order = c(1, 1, 0))
data.frame(forecasted[],
Price = forecast.Arima(fit, xreg = forecasted[3:4])$mean )
})
result
现在我会将 n
更改为 99 * 24。但是将其循环或应用会很棒,但我只是不知道该怎么做,并且还保存了每个新的预测。
我试过这样的循环,但还没有成功:
# 100 days ago, forecast that day, then the next, etc.
for (n in 1:100) {
nx <- n * 24 * 80 # Because I want to start after 80 days
df[nx:(nx + 23), 5] <- NA # Set prices to NA so I can forecast them
fx(df) # do the function
df.results[n] <- # Write the results into a vector / data frame to save them
# and now rinse and repeat for n + 1
}
对于类似 broom
的解决方案来说真是太棒了加分 :)
首先我会注意到您的 for 循环中存在错误。而不是 n*24*80
你可能意味着 (n+80)*24
。如果您还想包括第 81 天的预测,则循环中的计数器也应该从 0 到 99 而不是 1 到 100。
下面我将尝试为您的问题提供一个优雅的解决方案。首先,我们按照您在 post:
中所做的完全相同的方式定义我们的测试数据框set.seed(2)
df <- data.frame(
Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"),
to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1,
Wind = runif(4320, min = 1, max = 5000),
Temp = runif(4320, min = - 20, max = 25),
Price = runif(4320, min = -15, max = 45)
)
接下来,我们定义一个函数来执行特定日期的预测。输入参数是正在考虑的数据框和训练集中应该包含的最少训练天数(本例中为 80)。 minTrainingDays+offSet+1
表示我们预测的实际日期。请注意,我们从 0 开始计算偏移量。
forecastOneDay <- function(theData,minTrainingDays,offset)
{
nrTrainingRows <- (minTrainingDays+offset)*24
theForecast <- theData %>%
filter(min_rank(Date) <= nrTrainingRows+24) %>% # Drop future data that we don't need
group_by(Hour) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
}
我们要预测第 81-180 天。换句话说,我们的训练集至少需要 80 天,并且要计算偏移量 0:99
的函数结果。这可以通过简单的 lapply
调用来完成。最后,我们将所有结果合并到一个数据框中:
# Perform one day forecasts for days 81-180
resultList <- lapply(0:99, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)
编辑 在查看了您的 post 和另一个 post 的答案后,我注意到我的答案有两个潜在的问题。首先,您需要 rolling window 80 天的训练数据。但是,在我之前的代码中,所有可用的训练数据都用于拟合模型,而不是仅返回 80 天。其次,代码对 DST 更改不健壮。
这两个问题已在下面的代码中修复。该函数的输入现在也更加直观:训练天数和实际预测天数可以用作输入参数。请注意,在对日期执行操作时,POSIXlt
数据格式可以正确处理 DST、闰年等内容。因为数据框中的日期是 POSIXct
类型,我们需要来回进行小型类型转换才能正确处理。
下面的新代码:
forecastOneDay <- function(theData,nrTrainingDays,predictDay) # predictDay should be greater than nrTrainingDays
{
initialDate <- as.POSIXlt(theData$Date[1]); # First day (midnight hour)
startDate <- initialDate # Beginning of training interval
endDate <- initialDate # End of test interval
startDate$mday <- initialDate$mday + (predictDay-nrTrainingDays-1) # Go back 80 days from predictday
endDate$mday <- startDate$mday + (nrTrainingDays+1) # +1 to include prediction day
theForecast <- theData %>%
filter(Date >= as.POSIXct(startDate),Date < as.POSIXct(endDate)) %>%
group_by(Hour) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
}
# Perform one day forecasts for days 81-180
resultList <- lapply(81:180, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)
结果如下所示:
> head(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour
Date Hour Wind Temp realPrice predictedPrice
1 2015-03-22 00:00:00 1 1691.589 -8.722152 -11.207139 5.918541
2 2015-03-22 01:00:00 2 1790.928 18.098358 3.902686 37.885532
3 2015-03-22 02:00:00 3 1457.195 10.166422 22.193270 34.984164
4 2015-03-22 03:00:00 4 1414.502 4.993783 6.370435 12.037642
5 2015-03-22 04:00:00 5 3020.755 9.540715 25.440357 -1.030102
6 2015-03-22 05:00:00 6 4102.651 2.446729 33.528199 39.607848
> tail(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour
Date Hour Wind Temp realPrice predictedPrice
1 2015-06-29 18:00:00 19 1521.9609 13.6414797 12.884175 -6.7789109
2 2015-06-29 19:00:00 20 555.1534 3.4758159 37.958768 -5.1193514
3 2015-06-29 20:00:00 21 4337.6605 4.7242352 -9.244882 33.6817379
4 2015-06-29 21:00:00 22 3140.1531 0.8127839 15.825230 -0.4625457
5 2015-06-29 22:00:00 23 1389.0330 20.4667234 -14.802268 15.6755880
6 2015-06-29 23:00:00 24 763.0704 9.1646139 23.407525 3.8214642
可以使用 dplyr 创建一个 "rolling" data.frame,如下所示
library(dplyr)
library(lubridate)
WINDOW_SIZE_DAYS <- 80
df2 <- df %>%
mutate(Day = yday(Date)) %>%
replicate( n = WINDOW_SIZE_DAYS, simplify = FALSE ) %>%
bind_rows %>%
group_by(Date) %>%
mutate(Replica_Num = 1:n() ) %>%
mutate(Day_Group_id = Day + Replica_Num - 1 ) %>%
ungroup() %>%
group_by(Day_Group_id) %>%
filter( n() >= 24*WINDOW_SIZE_DAYS - 1 ) %>%
select( -Replica_Num ) %>%
arrange(Date) %>%
ungroup()
基本上,此代码会根据需要复制观察结果,并为每个 80 天的块分配相应的 Day_Group_id
。
这样做的目的是允许人们使用 group_by(Day_Group_id)
以便 运行 在每个 80 天的块上分别使用模型。
以后大家可以随意使用了。例如,只是 copy/pasting 上面的 arima 代码如下:
df3 <- df2 %>%
group_by(Day_Group_id, Hour) %>%
arrange(Date) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
请注意:
这里使用filter(n() >= 24*WINDOW_SIZE_DAYS - 1)
代替filter(n() == 24*WINDOW_SIZE_DAYS)
,是为了select完整的80天windows。这是由于 2015-03-08
的夏令时调整所致。数据集中不存在小时 2015-03-08 02:00:00
,因为它从 2015-03-08 01:00:00
直接跳到 2015-03-08 03:00:00
。