R 正确使用 na.locf() 的 across() 函数

R Proper use for across() function with na.locf()

我正在尝试复制 ,但通过使用更新的语法,该语法使用 across() 函数并摆脱了已弃用的 summarise_all()funs()

起始数据

我有一个数据库提取每个事件类型一行,如下所示:

library(tidyverse)
library(zoo)

df_start <- tibble(shipment = c(rep("A",4), rep("B",4)), 
             stop = rep(c(1,1,2,2), 2),
             arrive_pickup = as.POSIXct(c("2021-01-01 07:00:00 UTC",NA, NA, NA,"2021-06-05 12:10:00 UTC", NA, NA, NA)),
             depart_pickup = as.POSIXct(c(NA,"2021-01-01 08:40:00 UTC", NA, NA, NA, "2021-06-05 16:58:00 UTC", NA, NA)),
             arrive_delivery = as.POSIXct(c(NA, NA, "2021-01-05 10:00:00 UTC",NA, NA, NA,"2021-06-08 10:58:00 UTC", NA)),
             depart_delivery = as.POSIXct(c(NA, NA, NA, "2021-01-05 11:30:00 UTC",NA, NA, NA,"2021-06-08 13:50:00 UTC"))
)

> df_start
# A tibble: 8 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 NA                  NA                  NA                 
2 A            1 NA                  2021-01-01 08:40:00 NA                  NA                 
3 A            2 NA                  NA                  2021-01-05 10:00:00 NA                 
4 A            2 NA                  NA                  NA                  2021-01-05 11:30:00
5 B            1 2021-06-05 12:10:00 NA                  NA                  NA                 
6 B            1 NA                  2021-06-05 16:58:00 NA                  NA                 
7 B            2 NA                  NA                  2021-06-08 10:58:00 NA                 
8 B            2 NA                  NA                  NA                  2021-06-08 13:50:00

期望的结果

... 我想通过按出货量和停靠点分组,甚至仅按出货量分组来折叠行数(我不确定在最终数据框中保留 NA 是否会影响答案,但我正在寻求能够以任何一种方式解决它)。

df_finish1 # 一个想要的结果

# A tibble: 4 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00

df_finish2 # Second/alternative 期望的结果

# A tibble: 2 x 5
  shipment arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dttm>              <dttm>              <dttm>              <dttm>             
1 A        2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B        2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

我研究和尝试过的

基于 ,它确实有效:

df_1 <- df_start %>% 
  group_by(shipment, stop) %>%   # Two groupings
  summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>% 
  filter(row_number()==n())
  
> df_1
# A tibble: 4 x 6
# Groups:   shipment, stop [4]
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00
df_2 <- df_start %>% 
  group_by(shipment) %>%   # Single grouping
  summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>% 
  filter(row_number()==n())

> df_2
# A tibble: 2 x 6
# Groups:   shipment [2]
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            2 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B            2 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

但我看到的是 summarise_all() 函数和 funs() 函数已被弃用,以后不再使用,所以我试图了解如何使用 across() 正常运行,但没有成功:

df_3 <- df_start %>% 
  group_by(shipment) %>% 
  summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))

> df_3 <- df_start %>% 
+   group_by(shipment) %>% 
+   summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
Error: Problem with `summarise()` input `..2`.
x Input `..2` must be size 4 or 1, not 8.
i An earlier column had size 4.
i Input `..2` is `na.locf(., na.rm = FALSE, fromLast = FALSE)`.
i The error occurred in group 1: shipment = "A".

我已经通读了 vignette("colwise"),其中描述了不同之处,并建议我只替换上面所示的语法,但显然我没有做对。帮助?

这是一个选项,在按 'shipment'、'stop' 分组后,根据 NA 值对列进行排序,然后 filter 离开所有 NA 的行

library(dplyr)
df_start %>%
     group_by(shipment, stop) %>% 
     mutate(across(everything(), ~ .[order(is.na(.))])) %>% 
     filter(!if_all(everything(), is.na)) %>% 
     ungroup
# A tibble: 4 x 6
  shipment  stop arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
1 A            1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA                  NA                 
2 A            2 NA                  NA                  2021-01-05 10:00:00 2021-01-05 11:30:00
3 B            1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA                  NA                 
4 B            2 NA                  NA                  2021-06-08 10:58:00 2021-06-08 13:50:00

对于第二种情况,使用across

df_start %>% 
   group_by(shipment) %>% 
   dplyr::summarise(across(contains("_"), ~ na.omit(.)))
# A tibble: 2 x 5
  shipment arrive_pickup       depart_pickup       arrive_delivery     depart_delivery    
  <chr>    <dttm>              <dttm>              <dttm>              <dttm>             
1 A        2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B        2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00

在 OP 中,它使用 na.locf 而不是 na.omit 并且还有一个拼写错误,即 across 是在没有任何参数的情况下关闭的,即如果我们检查此 post,使用的语法是

...across(everything(), ~ .. # correct
...across(everything()) ... # incorrect 

因此,我们只需要将 ) 和指定 lambda 函数的 ~ 一起更改到末尾(否则 function(.) .

df_start %>% 
  group_by(shipment) %>% 
  summarise(across(everything(), ~ na.locf(., na.rm = FALSE, fromLast = FALSE)))

代码中有几个语法问题。

1 - 参数 .cols.fnsacross 内,在您的代码中 across 函数在 everything() 之后关闭(across(everything())).

  1. 当您在 across 中使用 . 时,您需要在其前面加上 ~ 以指定您正在为传递的函数使用 lambda 表达式。 (参见 ?across 中的 .fns 论点)。

合并此更改,您可以使用 -

library(dplyr)
library(zoo)

df_start %>% 
  group_by(shipment) %>% 
  summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))

然而,acrosseverything() 作为默认的 .cols 参数,你也可以在不需要 ~ 的情况下应用该函数,所以另一种写法是是-

df_start %>% 
  group_by(shipment) %>% 
  summarise(across(.fns = na.locf, na.rm = FALSE, fromLast = FALSE))