每日销售数据的时间序列预测

Time Series Forecasting of daily sales data

嗨,我是时间序列预测和 R 的新手。我从我兄弟的新创业公司收集每日销售数据来学习预测(从 28-03-19 到 7-7-21)。我想预测未来 30 天,有人可以帮我预测这些数据吗?我正在共享 link 这个 CSV 文件。我按照来自 kaggle 的人展示的步骤并尝试使用 ARIMA 进行预测,但与 MAPE 值为 28% 的测试数据集上的实际值相比,我得到的值完全不同。希望有人能指导我解决这个问题。

https://easyupload.io/hltskc

这是 2021 年的示例数据:

  data.frame(date=seq(as.Date("2021-01-01"),as.Date("2021-07-07"),by="1 day"),sales=c(100,105 ,167, 106 ,112, 107, 202,  98, 120, 109 ,114, 195, 110, 121,  89, 128, 104, 194 ,107 ,127, 117, 100, 117 ,205,116, 112, 119, 129, 161, 132, 110, 114, 118, 194, 114, 113, 113 ,172, 101 ,161 ,102, 135,  97, 122, 170, 126 ,160, 110, 118, 108,  111, 163, 110, 123 ,102 ,116, 181, 119, 155, 108, 122, 169, 115, 122, 116, 168 ,115 ,101, 117, 113 ,163, 115 ,107, 106, 171 ,109, 119, 107, 101 ,166, 105, 102 ,174,108, 102, 114,  97, 114, 149, 100, 111,  94,110 ,108, 100 , 92 ,104, 112, 160, 105,  98,  91,117 , 44,  60 , 36 , 50 , 51 , 54,  62,  61 , 62 , 50 , 59 , 85 , 49,  61 , 56 , 63,  39, 110 , 56 , 54 , 55,  56,  63 , 44, 115,  55,  50,  96 ,129 , 61,  59,  98 , 90 ,153,  90,  82 , 98,  79, 149,  97 , 85,  92,  78, 100 , 69, 152,  88,  76 , 91 ,145, 106,  69,  84,  72, 144 , 76, 74 , 94  ,70 , 86  ,76 ,137 , 71  ,87 , 91,  62 ,150 , 66 , 77  ,88, 135,  93 , 62 , 83, 83 , 72,  71 ,148 , 91 , 68 , 78 , 95 ,124,  69  ,78))

一个简单的出路是 facebook 发布的 prophet 包。这个预测很容易解决,但你必须进行质量评估(对预测进行评分)。

library(prophet) # possibly have to install it
set.seed(1234) # make it reproducable
# set the column names that prophet requires (ds has to be date type)
names(df) <- c("ds", "y")
# build the model in automatic mode (there are some parameter to tweak)
mod <- prophet::prophet(df)
# build the future df (days you want to predict
fdf <- prophet::make_future_dataframe(mod, periods = 92)
# make the prediction
forec <- predict(mod, fdf)
# plot the data and forecast
plot(mod, forec)

head(forec$yhat)
[1] 121.2439 137.5267 140.1426 132.0549 134.5965 136.5455

您可能希望根据转换需求(BoxCox、log、diff 等)研究分析时间序列的各个方面。查看您的数据(黑点)可以解释,由于 4 月底和 5 月的数据相对较低,因此存在外部回归变量(附加解释变量)。这个独立变量可能是 predicatable/know,因此对预测质量有重大影响。由于您使用的是每日数据,因此可能会有一些信息被隐藏到假期中,而这些信息在 Prophet 的建模中没有正确表示。一个关于数据长度的结束建议:根据你的预测范围,大约六个月的输入可能是不够的,特别是如果有一些年度模式需要考虑(经验法则是使用至少 2 年的数据,更好 3 ).

您可能对预测包及其功能感兴趣,但我认为它们在日间隔和短输入数据下的表现不会很好:

library(forecast)
# make time series object
tsdf <- ts(df$y, frequency = 365)
# make an auto.arima forecast model
mod2 <- forecast::auto.arima(log(tsdf))
# predict
plot(forecast::forecast(mod2, h = 92))

# make a "hip" prediction using neural net with 3 hidden neurons
mod3 <- forecast::nnetar(tsdf, P = 3, repeats = 1000)
plot(forecast::forecast(mod3, h = 92))

有些事情你可能会想 look/read/try:

使用的数据:

df <- data.frame(date=seq(as.Date("2021-01-01"),as.Date("2021-07-07"),by="1 day"),sales=c(100,105 ,167, 106 ,112, 107, 202,  98, 120, 109 ,114, 195, 110, 121,  89, 128, 104, 194 ,107 ,127, 117, 100, 117 ,205,116, 112, 119, 129, 161, 132, 110, 114, 118, 194, 114, 113, 113 ,172, 101 ,161 ,102, 135,  97, 122, 170, 126 ,160, 110, 118, 108,  111, 163, 110, 123 ,102 ,116, 181, 119, 155, 108, 122, 169, 115, 122, 116, 168 ,115 ,101, 117, 113 ,163, 115 ,107, 106, 171 ,109, 119, 107, 101 ,166, 105, 102 ,174,108, 102, 114,  97, 114, 149, 100, 111,  94,110 ,108, 100 , 92 ,104, 112, 160, 105,  98,  91,117 , 44,  60 , 36 , 50 , 51 , 54,  62,  61 , 62 , 50 , 59 , 85 , 49,  61 , 56 , 63,  39, 110 , 56 , 54 , 55,  56,  63 , 44, 115,  55,  50,  96 ,129 , 61,  59,  98 , 90 ,153,  90,  82 , 98,  79, 149,  97 , 85,  92,  78, 100 , 69, 152,  88,  76 , 91 ,145, 106,  69,  84,  72, 144 , 76, 74 , 94  ,70 , 86  ,76 ,137 , 71  ,87 , 91,  62 ,150 , 66 , 77  ,88, 135,  93 , 62 , 83, 83 , 72,  71 ,148 , 91 , 68 , 78 , 95 ,124,  69  ,78))

这是对多个模型的输出进行平均(取中值)并生成预测的一种相当基本的方法。请注意,none 的必要 EDA、模型调整或模型验证已在建模之前或之后进行。此外,我不确定这种平均预测区间的方法是否合理。有关最佳实践方法,请参阅 here

# Install pacakges if they are not already installed: necessary_packages => vector
necessary_packages <- c("forecast", "ggplot2")

# Create a vector containing the names of any packages needing installation:
# new_pacakges => vector
new_packages <- necessary_packages[!(necessary_packages %in%
                                       installed.packages()[, "Package"])]

# If the vector has more than 0 values, install the new pacakges
# (and their) associated dependencies:
if (length(new_packages) > 0) {
  install.packages(new_packages, dependencies = TRUE)
}

# Initialise the packages in the session: list of boolean => stdout (console)
lapply(necessary_packages, require, character.only = TRUE)

# Coerce to multiple seasonality time series object: 
y <- msts(
  df$sales,
  start = c(
    min(as.integer(strftime(df$date, "%Y"))), 
    1
  ),
  seasonal.periods = c(
    7.009615384615385, 
    30.5
  )
)

# Function to coalesce fit values with actuals: 
coalesceActualsFit <- function(fit_obj){
  
  # Coalesce actual values with fit values:
  # res => double vector
  res <- transform(
    setNames(
      data.frame(
        cbind(
          fit_obj$x, 
          fit_obj$fitted
        )[,1:2]
      ),
      c("x", "xhat")
    ),
    xhat = ifelse(is.na(xhat), x, xhat)
  )[,2,drop=TRUE]
  
  # Explicitly define returned object: 
  # double vector => Global Env()  
  return(res)
  
}

# Store the forecast period (n-days):
fcast_period <- 30

# Fit a holt winters model:
hw_fit <- HoltWinters(y)

# Remove NAs from the fitted values: 
hw_fit_vals <- coalesceActualsFit(hw_fit)

# Forecast out n-days:
hw_fcast <- forecast(hw_fit, h = fcast_period)

# Fit a neural network: 
nnetar_fit <- nnetar(y, P = 10, repeats = 50)

# Remove NAs from the fitted values: 
nnetar_fit_vals <- coalesceActualsFit(nnetar_fit)

# Forecast out n-days: 
nnetar_fcast <- forecast(nnetar_fit, h = fcast_period)

# Fit an stlf model: 
stlf_fit <- stlf(y)

# Forecast out n-days: 
stlf_fcast <- forecast(stlf_fit, h = fcast_period)

# Create an ensemble for the predictions: 
med_predict <- apply(
  cbind(hw_fit_vals, nnetar_fit_vals, stlf_fit$fitted),
  1, 
  median
)

# Create an ensemble for the forecasts: 
med_fcast <- apply(
  cbind(hw_fcast$mean, nnetar_fcast$mean, stlf_fcast$mean),
  1, 
  median
)

# Calculate the enesmble prediction intervals:
fcast_pis <- do.call(
  cbind,
  lapply(
    c("lower", "upper"),
    function(x){
      y <- data.frame(
        cbind(
          hw_fcast[[x]],
          nnetar_fcast[[x]],
          stlf_fcast[[x]]
          )
        )
      pi_80 <- apply(
        y[,endsWith(names(y), "80.")],
        1,
        median
      )
      pi_95 <- apply(
        y[,endsWith(names(y), "95.")],
        1,
        median
      )
      cbind(
        setNames(data.frame(pi_80), paste0(x, "_80")),
        setNames(data.frame(pi_95), paste0(x, "_95"))
      )
    }
  )
)

# Combine them into a single vector:
pt_fcasts <- c(
  med_predict, 
  med_fcast
)

# Create a data.frame containing the original data and 
# the point forecasts: 
fcast_df <- transform(
  data.frame(
    actuals = c(df$sales, rep(NA_real_, fcast_period)),
    pnt_fcasts = pt_fcasts,
    date = 
      as.Date(
        c(
          df$date,
          seq(
            max(df$date) + 1, 
            max(df$date) + fcast_period, 
            by = 1
          )
        ),
        "%Y-%m-%d"
      )
  ),
  pnt_fcasts = ifelse(is.na(pnt_fcasts), sales, pnt_fcasts)
)


# Adjust the forecast predition intervals to be the same
# length as the combined data: 
fcast_pis_adjusted <- transform(
  rbind(
    setNames(
      data.frame(
        do.call(
          cbind, 
          lapply(seq_len(ncol(fcast_pis)), function(i){
            pt_fcasts[1:nrow(df)]
          }
          )
        )
      ),
      names(fcast_pis)
    ),
    fcast_pis
  ),
  date = fcast_df$date
)

# Merge with the point forecast data: 
chart_data <- merge(
  fcast_df,
  fcast_pis_adjusted,
  by = "date"
)

# Chart the forecast vs actuals: 
ggplot(chart_data, aes(date)) + 
  geom_line(aes(y = actuals, colour = "Actuals")) + 
  geom_line(aes(y = lower_80, colour = "Lower 80 PI")) + 
  geom_line(aes(y = lower_95, colour = "Lower 95 PI")) +
  geom_line(aes(y = upper_80, colour = "Upper 80 PI")) +
  geom_line(aes(y = upper_95, colour = "Upper 95 PI")) +
  geom_line(aes(y = pnt_fcasts, colour = "Point Forecast")) +
  xlab("Date") +
  ylab("Sales") + 
  labs(colour = "Series")

数据:

df <- data.frame(
  date = seq(as.Date("2021-01-01"), as.Date("2021-07-07"), by = "1 day"),
  sales = c(
    100,
    105 ,
    167,
    106 ,
    112,
    107,
    202,
    98,
    120,
    109 ,
    114,
    195,
    110,
    121,
    89,
    128,
    104,
    194 ,
    107 ,
    127,
    117,
    100,
    117 ,
    205,
    116,
    112,
    119,
    129,
    161,
    132,
    110,
    114,
    118,
    194,
    114,
    113,
    113 ,
    172,
    101 ,
    161 ,
    102,
    135,
    97,
    122,
    170,
    126 ,
    160,
    110,
    118,
    108,
    111,
    163,
    110,
    123 ,
    102 ,
    116,
    181,
    119,
    155,
    108,
    122,
    169,
    115,
    122,
    116,
    168 ,
    115 ,
    101,
    117,
    113 ,
    163,
    115 ,
    107,
    106,
    171 ,
    109,
    119,
    107,
    101 ,
    166,
    105,
    102 ,
    174,
    108,
    102,
    114,
    97,
    114,
    149,
    100,
    111,
    94,
    110 ,
    108,
    100 ,
    92 ,
    104,
    112,
    160,
    105,
    98,
    91,
    117 ,
    44,
    60 ,
    36 ,
    50 ,
    51 ,
    54,
    62,
    61 ,
    62 ,
    50 ,
    59 ,
    85 ,
    49,
    61 ,
    56 ,
    63,
    39,
    110 ,
    56 ,
    54 ,
    55,
    56,
    63 ,
    44,
    115,
    55,
    50,
    96 ,
    129 ,
    61,
    59,
    98 ,
    90 ,
    153,
    90,
    82 ,
    98,
    79,
    149,
    97 ,
    85,
    92,
    78,
    100 ,
    69,
    152,
    88,
    76 ,
    91 ,
    145,
    106,
    69,
    84,
    72,
    144 ,
    76,
    74 ,
    94  ,
    70 ,
    86  ,
    76 ,
    137 ,
    71  ,
    87 ,
    91,
    62 ,
    150 ,
    66 ,
    77  ,
    88,
    135,
    93 ,
    62 ,
    83,
    83 ,
    72,
    71 ,
    148 ,
    91 ,
    68 ,
    78 ,
    95 ,
    124,
    69  ,
    78
  )
)