do() 被取代了!替代方法是使用 across()、nest_by() 和总结,如何?

do() superseded! Alternative is to use across(), nest_by(), and summarise, how?

我正在做一些很简单的事情。给定特定时间段的开始日期和结束日期的数据框,我想 expand/create 每个时间段的完整序列(每行都有一个因子),然后将其输出到一个大数据框中。

例如:

library(tidyverse)
library(lubridate)

# Dataset
  start_dates = ymd_hms(c("2019-05-08 00:00:00",
                          "2020-01-17 00:00:00",
                          "2020-03-03 00:00:00",
                          "2020-05-28 00:00:00",
                          "2020-12-10 00:00:00",
                          "2021-05-07 00:00:00",
                          "2022-01-04 00:00:00"), tz = "UTC")
  
  end_dates = ymd_hms(c( "2019-10-24 00:00:00",
                         "2020-03-03 00:00:00", 
                         "2020-05-28 00:00:00",
                         "2020-12-10 00:00:00",
                         "2021-05-07 00:00:00",
                         "2022-01-04 00:00:00",
                         "2022-01-19 00:00:00"), tz = "UTC") 
  
  df1 = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)

有人建议我使用 do(),它目前工作正常,但我讨厌它被取代。我也有一种使用 map2 的方法。但是阅读文件(https://dplyr.tidyverse.org/reference/do.html)建议您可以使用 nest_by()、across() 和 summarise() 来完成与 do() 相同的工作,我将如何获得相同的结果?我已经尝试了很多东西,但我似乎无法得到它。

# do() way to do it
df1 %>% 
  group_by(studying) %>% 
  do(data.frame(week=seq(.$start_dates,.$end_dates,by="1 week")))
# transmute() way to do it
 df1 %>% 
  transmute(weeks = map2(start_dates,end_dates, seq, by = "1 week"), studying) 
 %>% unnest(cols = c(weeks))

不确定这是否正是您要找的,但这是我对 rowwiseunnest

的尝试
df1 %>% 
  rowwise() %>% 
  mutate(week = list(seq(start_dates, end_dates, by = "1 week"))) %>% 
  select(studying, week) %>% 
  unnest(cols = c(week))

你也可以使用tidyr::complete:

df1 %>% 
  group_by(studying) %>% 
  complete(start_dates = seq(from = start_dates, to = end_dates, by = "1 week")) %>% 
  select(-end_dates, weeks = start_dates)

# A tibble: 134 x 2
# Groups:   studying [7]
   studying weeks              
   <chr>    <dttm>             
 1 period1  2019-05-08 00:00:00
 2 period1  2019-05-15 00:00:00
 3 period1  2019-05-22 00:00:00
 4 period1  2019-05-29 00:00:00
 5 period1  2019-06-05 00:00:00
 6 period1  2019-06-12 00:00:00
 7 period1  2019-06-19 00:00:00
 8 period1  2019-06-26 00:00:00
 9 period1  2019-07-03 00:00:00
10 period1  2019-07-10 00:00:00
# ... with 124 more rows

正如 ?do 的文档所建议的,我们现在可以使用 summarise 并将 . 替换为 across():

library(tidyverse)
library(lubridate)

df1 %>% 
  group_by(studying) %>% 
  summarise(week = seq(across()$start_dates,
                       across()$end_dates,
                       by = "1 week"))
#> `summarise()` has grouped output by 'studying'. You can override using the
#> `.groups` argument.
#> # A tibble: 134 x 2
#> # Groups:   studying [7]
#>    studying week               
#>    <chr>    <dttm>             
#>  1 period1  2019-05-08 00:00:00
#>  2 period1  2019-05-15 00:00:00
#>  3 period1  2019-05-22 00:00:00
#>  4 period1  2019-05-29 00:00:00
#>  5 period1  2019-06-05 00:00:00
#>  6 period1  2019-06-12 00:00:00
#>  7 period1  2019-06-19 00:00:00
#>  8 period1  2019-06-26 00:00:00
#>  9 period1  2019-07-03 00:00:00
#> 10 period1  2019-07-10 00:00:00
#> # … with 124 more rows

reprex package (v0.3.0)

于 2022 年 1 月 19 日创建

虽然标记为 实验性 group_modify 的帮助文件确实说

‘group_modify()’ is an evolution of ‘do()’

事实上,问题中使用 group_modify 的示例代码与 do 几乎相同。

# with group_modify
df2 <- df1 %>% 
  group_by(studying) %>% 
  group_modify(~ data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))

# with do
df0 <- df1 %>% 
  group_by(studying) %>% 
  do(data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))

identical(df2, df0)
## [1] TRUE

另一种方法:

library(tidyverse)

df1 %>%
    group_by(studying) %>%
    summarise(df = tibble(weeks = seq(start_dates, end_dates, by = 'week'))) %>%
    unnest(df)
#> `summarise()` has grouped output by 'studying'. You can override using the `.groups` argument.
#> # A tibble: 134 × 2
#> # Groups:   studying [7]
#>    studying weeks              
#>    <chr>    <dttm>             
#>  1 period1  2019-05-08 00:00:00
#>  2 period1  2019-05-15 00:00:00
#>  3 period1  2019-05-22 00:00:00
#>  4 period1  2019-05-29 00:00:00
#>  5 period1  2019-06-05 00:00:00
#>  6 period1  2019-06-12 00:00:00
#>  7 period1  2019-06-19 00:00:00
#>  8 period1  2019-06-26 00:00:00
#>  9 period1  2019-07-03 00:00:00
#> 10 period1  2019-07-10 00:00:00
#> # … with 124 more rows

reprex package (v2.0.1)

创建于 2022-01-20