应用 rsample 包中的 rolling_origin 函数后取消嵌套深层列表

Unnesting deep lists after applying the rolling_origin function from the rsample package

我有一些数据如下:

头:

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df1   2020-03-01 grp1       0.175 
2 df1   2020-03-01 grp2       0.150 
3 df1   2020-03-01 grp3       0.0509

尾巴:

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df3   2020-06-29 grp7        0.705
2 df3   2020-06-29 grp8        0.473
3 df3   2020-06-29 grp9        0.900

这是一个时间序列数据,有 3 个独特的 dfID 和 9 个独特的 group。将日期列过滤到我拥有的一天(3 个 df 和 9 个组):

df %>% 
  filter(date == "2020-03-01")

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df1   2020-03-01 grp1       0.175 
2 df1   2020-03-01 grp2       0.150 
3 df1   2020-03-01 grp3       0.0509
4 df2   2020-03-01 grp4       0.133 
5 df2   2020-03-01 grp5       0.779 
6 df2   2020-03-01 grp6       0.506 
7 df3   2020-03-01 grp7       0.868 
8 df3   2020-03-01 grp8       0.552 
9 df3   2020-03-01 grp9       0.274

接下来我想将数据分成 dfID 的成对组合。

combinedSplit <- combn(levels(as.factor(df$dfID)), m = 2, FUN = function(x)
  df %>% 
    filter(dfID %in% x), simplify = FALSE)

names(combinedSplit) <- combn(levels(as.factor(df$dfID)), m = 2, str_c, collapse="_")

所以现在我有一个包含 3 个 df 的列表 - dfID 的每个成对组合一个:

df1_df2
df1_df3
df2_df3

现在,我想应用 rsample 包中的 rolling_origin 函数将时间序列拆分为训练数据和测试数据。我可以通过映射列表来天真地应用该函数。

rolledData <-  combinedSplit %>% 
  map(., ~ rolling_origin(
    data = .,
    initial = 60,
    assess = 1,
    cumulative = FALSE,
    skip = 0
  )
)

我可以访问组合 df1_df2 的第一个拆分的尾部。

map(rolledData$df1_df2$splits, ~analysis(.x))[[1]] %>% tail()

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df1   2020-03-19 grp1        0.528
2 df1   2020-03-19 grp2        0.394
3 df1   2020-03-19 grp3        0.532
4 df1   2020-03-20 grp1        0.586
5 df1   2020-03-20 grp2        0.369
6 df1   2020-03-20 grp3        0.153

这是不正确的。在 rolling_origin 函数中,我将 training/assessment 期间指定为 60 个周期(天),但此数据在 3 月 20 日结束。这是因为它采用的不是时间序列数据的前 60 个观察值(3 group's * 20 天)。

所以我想将 rolling_origin 函数应用到每个 grp - 每个 grp 有 60 天滚动 windows.

在这里,我认为最好先nest() group然后再应用rolling_origin函数,从那以后每个group' s 是分开的,60 天将对应于 grp1grp2grp3。然后,unnest() groupgrp1grp2grp3 放回同一个数据框中。

nestedRolledData <- map(combinedSplit, ~group_by(.x, group) %>% 
      nest() %>% 
      mutate(
        rolledData = map(data, ~.x %>% 
                           rolling_origin(
                             data = .,
                             initial = 60,
                             assess = 0,
                             cumulative = FALSE,
                             skip = 0
                           )
                         )
        )
)

我在取消嵌套时遇到困难。

正在检查列表的结构:

我可以使用 map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[1]] 访问第一个 splits,这为我提供了一个包含 60 个观察值的数据框,从 2020-03-01 开始到 2020-04-29 结束。对于此列表中的第二个拆分 map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[2]] 我有另一个包含 60 个观测值的数据框,这次从 2020-03-02 开始到 2020-04-30 结束(因此此数据已移动一天).我可以这样做直到 [[63]],从 2020-05-02 开始到 2020-06-30 结束(这是我数据中的最后一天)。

这就是我想要的 - 即数据为每个 grp 进行了正确的时间序列分割。现在我想取消嵌套并将它们放回正确的数据框中。回到列表的第 3 层,其中包含 [[1]], ... , [[6]] 6 个列表。这些对应于 2 dfIDs 的 3 个组中的每一个。所以我基本上想将这些列表合并在一起。

如果我手动构建这些,它可能看起来像:

# grp 1:6 for rolling_origin split 1
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[1]]

# grp 1:6 for rolling_origin split 2
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[2]]

# ...

# grp 1:6 for rolling_origin split 63
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[63]]

数据:

library(rsample)
df1 = data.frame(
  date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
  grp1 = runif(122),
  grp2 = runif(122),
  grp3 = runif(122)
) %>% 
  pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>% 
  add_column(dfID = "df1")

df2 = data.frame(
  date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
  grp4 = runif(122),
  grp5 = runif(122),
  grp6 = runif(122)
) %>% 
  pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>% 
  add_column(dfID = "df2")

df3 = data.frame(
  date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
  grp7 = runif(122),
  grp8 = runif(122),
  grp9 = runif(122)
) %>% 
  pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>% 
  add_column(dfID = "df3")

df <- bind_rows(df1, df2, df3) %>% 
  relocate(dfID, .before = date)



map(rolledData$splits, ~analysis(.x))[[1]] %>% tail()

combinedSplit <- combn(levels(as.factor(df$dfID)), m = 2, FUN = function(x)
  df %>% 
    filter(dfID %in% x), simplify = FALSE)

names(combinedSplit) <- combn(levels(as.factor(df$dfID)), m = 2, str_c, collapse="_")

rolledData <-  combinedSplit %>% 
  map(., ~ rolling_origin(
    data = .,
    initial = 60,
    assess = 1,
    cumulative = FALSE,
    skip = 0
  )
)

map(rolledData$df1_df2$splits, ~analysis(.x))[[1]] %>% tail()


nestedRolledData <- map(combinedSplit, ~group_by(.x, group) %>% 
      nest() %>% 
      mutate(
        rolledData = map(data, ~.x %>% 
                           rolling_origin(
                             data = .,
                             initial = 60,
                             assess = 0,
                             cumulative = FALSE,
                             skip = 0
                           )
                         )
        )
)

可能我们需要一个 3 嵌套 map

library(purrr)
library(rsample)
out <- map(nestedRolledData, ~ map(.x$rolledData, ~ map(.x$splits, analysis)))

str(out, max.level = 3)
#List of 3
# $ df1_df2:List of 6
#  ..$ :List of 63
#  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
#  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
#  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
#  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
#  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
# ..