R 正确使用 na.locf() 的 across() 函数
R Proper use for across() function with na.locf()
我正在尝试复制 ,但通过使用更新的语法,该语法使用 across()
函数并摆脱了已弃用的 summarise_all()
和 funs()
。
起始数据
我有一个数据库提取每个事件类型一行,如下所示:
library(tidyverse)
library(zoo)
df_start <- tibble(shipment = c(rep("A",4), rep("B",4)),
stop = rep(c(1,1,2,2), 2),
arrive_pickup = as.POSIXct(c("2021-01-01 07:00:00 UTC",NA, NA, NA,"2021-06-05 12:10:00 UTC", NA, NA, NA)),
depart_pickup = as.POSIXct(c(NA,"2021-01-01 08:40:00 UTC", NA, NA, NA, "2021-06-05 16:58:00 UTC", NA, NA)),
arrive_delivery = as.POSIXct(c(NA, NA, "2021-01-05 10:00:00 UTC",NA, NA, NA,"2021-06-08 10:58:00 UTC", NA)),
depart_delivery = as.POSIXct(c(NA, NA, NA, "2021-01-05 11:30:00 UTC",NA, NA, NA,"2021-06-08 13:50:00 UTC"))
)
> df_start
# A tibble: 8 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 NA NA NA
2 A 1 NA 2021-01-01 08:40:00 NA NA
3 A 2 NA NA 2021-01-05 10:00:00 NA
4 A 2 NA NA NA 2021-01-05 11:30:00
5 B 1 2021-06-05 12:10:00 NA NA NA
6 B 1 NA 2021-06-05 16:58:00 NA NA
7 B 2 NA NA 2021-06-08 10:58:00 NA
8 B 2 NA NA NA 2021-06-08 13:50:00
期望的结果
... 我想通过按出货量和停靠点分组,甚至仅按出货量分组来折叠行数(我不确定在最终数据框中保留 NA
是否会影响答案,但我正在寻求能够以任何一种方式解决它)。
df_finish1 # 一个想要的结果
# A tibble: 4 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
df_finish2 # Second/alternative 期望的结果
# A tibble: 2 x 5
shipment arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dttm> <dttm> <dttm> <dttm>
1 A 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
我研究和尝试过的
基于 ,它确实有效:
df_1 <- df_start %>%
group_by(shipment, stop) %>% # Two groupings
summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
> df_1
# A tibble: 4 x 6
# Groups: shipment, stop [4]
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
df_2 <- df_start %>%
group_by(shipment) %>% # Single grouping
summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
> df_2
# A tibble: 2 x 6
# Groups: shipment [2]
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 2 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
但我看到的是 summarise_all()
函数和 funs()
函数已被弃用,以后不再使用,所以我试图了解如何使用 across()
正常运行,但没有成功:
df_3 <- df_start %>%
group_by(shipment) %>%
summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
> df_3 <- df_start %>%
+ group_by(shipment) %>%
+ summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
Error: Problem with `summarise()` input `..2`.
x Input `..2` must be size 4 or 1, not 8.
i An earlier column had size 4.
i Input `..2` is `na.locf(., na.rm = FALSE, fromLast = FALSE)`.
i The error occurred in group 1: shipment = "A".
我已经通读了 vignette("colwise")
,其中描述了不同之处,并建议我只替换上面所示的语法,但显然我没有做对。帮助?
这是一个选项,在按 'shipment'、'stop' 分组后,根据 NA 值对列进行排序,然后 filter
离开所有 NA 的行
library(dplyr)
df_start %>%
group_by(shipment, stop) %>%
mutate(across(everything(), ~ .[order(is.na(.))])) %>%
filter(!if_all(everything(), is.na)) %>%
ungroup
# A tibble: 4 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
对于第二种情况,使用across
df_start %>%
group_by(shipment) %>%
dplyr::summarise(across(contains("_"), ~ na.omit(.)))
# A tibble: 2 x 5
shipment arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dttm> <dttm> <dttm> <dttm>
1 A 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
在 OP 中,它使用 na.locf
而不是 na.omit
并且还有一个拼写错误,即 across
是在没有任何参数的情况下关闭的,即如果我们检查此 post,使用的语法是
...across(everything(), ~ .. # correct
...across(everything()) ... # incorrect
因此,我们只需要将 )
和指定 lambda 函数的 ~
一起更改到末尾(否则 function(.) .
df_start %>%
group_by(shipment) %>%
summarise(across(everything(), ~ na.locf(., na.rm = FALSE, fromLast = FALSE)))
代码中有几个语法问题。
1 - 参数 .cols
和 .fns
在 across
内,在您的代码中 across
函数在 everything()
之后关闭(across(everything())
).
- 当您在
across
中使用 .
时,您需要在其前面加上 ~
以指定您正在为传递的函数使用 lambda 表达式。 (参见 ?across
中的 .fns
论点)。
合并此更改,您可以使用 -
library(dplyr)
library(zoo)
df_start %>%
group_by(shipment) %>%
summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))
然而,across
有 everything()
作为默认的 .cols
参数,你也可以在不需要 ~
的情况下应用该函数,所以另一种写法是是-
df_start %>%
group_by(shipment) %>%
summarise(across(.fns = na.locf, na.rm = FALSE, fromLast = FALSE))
我正在尝试复制 across()
函数并摆脱了已弃用的 summarise_all()
和 funs()
。
起始数据
我有一个数据库提取每个事件类型一行,如下所示:
library(tidyverse)
library(zoo)
df_start <- tibble(shipment = c(rep("A",4), rep("B",4)),
stop = rep(c(1,1,2,2), 2),
arrive_pickup = as.POSIXct(c("2021-01-01 07:00:00 UTC",NA, NA, NA,"2021-06-05 12:10:00 UTC", NA, NA, NA)),
depart_pickup = as.POSIXct(c(NA,"2021-01-01 08:40:00 UTC", NA, NA, NA, "2021-06-05 16:58:00 UTC", NA, NA)),
arrive_delivery = as.POSIXct(c(NA, NA, "2021-01-05 10:00:00 UTC",NA, NA, NA,"2021-06-08 10:58:00 UTC", NA)),
depart_delivery = as.POSIXct(c(NA, NA, NA, "2021-01-05 11:30:00 UTC",NA, NA, NA,"2021-06-08 13:50:00 UTC"))
)
> df_start
# A tibble: 8 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 NA NA NA
2 A 1 NA 2021-01-01 08:40:00 NA NA
3 A 2 NA NA 2021-01-05 10:00:00 NA
4 A 2 NA NA NA 2021-01-05 11:30:00
5 B 1 2021-06-05 12:10:00 NA NA NA
6 B 1 NA 2021-06-05 16:58:00 NA NA
7 B 2 NA NA 2021-06-08 10:58:00 NA
8 B 2 NA NA NA 2021-06-08 13:50:00
期望的结果
... 我想通过按出货量和停靠点分组,甚至仅按出货量分组来折叠行数(我不确定在最终数据框中保留 NA
是否会影响答案,但我正在寻求能够以任何一种方式解决它)。
df_finish1 # 一个想要的结果
# A tibble: 4 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
df_finish2 # Second/alternative 期望的结果
# A tibble: 2 x 5
shipment arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dttm> <dttm> <dttm> <dttm>
1 A 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
我研究和尝试过的
基于
df_1 <- df_start %>%
group_by(shipment, stop) %>% # Two groupings
summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
> df_1
# A tibble: 4 x 6
# Groups: shipment, stop [4]
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
df_2 <- df_start %>%
group_by(shipment) %>% # Single grouping
summarise_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
> df_2
# A tibble: 2 x 6
# Groups: shipment [2]
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 2 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
但我看到的是 summarise_all()
函数和 funs()
函数已被弃用,以后不再使用,所以我试图了解如何使用 across()
正常运行,但没有成功:
df_3 <- df_start %>%
group_by(shipment) %>%
summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
> df_3 <- df_start %>%
+ group_by(shipment) %>%
+ summarise(across(everything()), na.locf(., na.rm = FALSE, fromLast = FALSE))
Error: Problem with `summarise()` input `..2`.
x Input `..2` must be size 4 or 1, not 8.
i An earlier column had size 4.
i Input `..2` is `na.locf(., na.rm = FALSE, fromLast = FALSE)`.
i The error occurred in group 1: shipment = "A".
我已经通读了 vignette("colwise")
,其中描述了不同之处,并建议我只替换上面所示的语法,但显然我没有做对。帮助?
这是一个选项,在按 'shipment'、'stop' 分组后,根据 NA 值对列进行排序,然后 filter
离开所有 NA 的行
library(dplyr)
df_start %>%
group_by(shipment, stop) %>%
mutate(across(everything(), ~ .[order(is.na(.))])) %>%
filter(!if_all(everything(), is.na)) %>%
ungroup
# A tibble: 4 x 6
shipment stop arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dbl> <dttm> <dttm> <dttm> <dttm>
1 A 1 2021-01-01 07:00:00 2021-01-01 08:40:00 NA NA
2 A 2 NA NA 2021-01-05 10:00:00 2021-01-05 11:30:00
3 B 1 2021-06-05 12:10:00 2021-06-05 16:58:00 NA NA
4 B 2 NA NA 2021-06-08 10:58:00 2021-06-08 13:50:00
对于第二种情况,使用across
df_start %>%
group_by(shipment) %>%
dplyr::summarise(across(contains("_"), ~ na.omit(.)))
# A tibble: 2 x 5
shipment arrive_pickup depart_pickup arrive_delivery depart_delivery
<chr> <dttm> <dttm> <dttm> <dttm>
1 A 2021-01-01 07:00:00 2021-01-01 08:40:00 2021-01-05 10:00:00 2021-01-05 11:30:00
2 B 2021-06-05 12:10:00 2021-06-05 16:58:00 2021-06-08 10:58:00 2021-06-08 13:50:00
在 OP 中,它使用 na.locf
而不是 na.omit
并且还有一个拼写错误,即 across
是在没有任何参数的情况下关闭的,即如果我们检查此 post,使用的语法是
...across(everything(), ~ .. # correct
...across(everything()) ... # incorrect
因此,我们只需要将 )
和指定 lambda 函数的 ~
一起更改到末尾(否则 function(.) .
df_start %>%
group_by(shipment) %>%
summarise(across(everything(), ~ na.locf(., na.rm = FALSE, fromLast = FALSE)))
代码中有几个语法问题。
1 - 参数 .cols
和 .fns
在 across
内,在您的代码中 across
函数在 everything()
之后关闭(across(everything())
).
- 当您在
across
中使用.
时,您需要在其前面加上~
以指定您正在为传递的函数使用 lambda 表达式。 (参见?across
中的.fns
论点)。
合并此更改,您可以使用 -
library(dplyr)
library(zoo)
df_start %>%
group_by(shipment) %>%
summarise(across(everything(), ~na.locf(., na.rm = FALSE, fromLast = FALSE)))
然而,across
有 everything()
作为默认的 .cols
参数,你也可以在不需要 ~
的情况下应用该函数,所以另一种写法是是-
df_start %>%
group_by(shipment) %>%
summarise(across(.fns = na.locf, na.rm = FALSE, fromLast = FALSE))