R 在 tidyverse 中创建多个日期范围
R create multiple ranges of dates in tidyverse
我正在尝试找到一种方法来分离我的数据中的各种间隔,以便与 ID 关联的每一行都有其最短和最长持续时间,按 NA 的月份分开。
我的数据看起来像这样,但有 9 列和 275 行
df <- data.frame("ID" = c(1:5),
"jan" = c("2020-01-01",NA, "2020-01-01", "2020-01-01", "2020-01-01"),
"feb" = c("2020-02-01", "2020-02-01", NA, "2020-02-01", "2020-02-01"),
"mar" = c("2020-03-01", "2020-03-01", NA, "2020-03-01", NA),
"apr" = c(NA, "2020-04-01", NA, "2020-04-01", "2020-04-01"),
"may" = c("2020-05-01", "2020-05-01", NA ,NA, "2020-05-01"),
"jun" = c("2020-06-01", "2020-06-01", "2020-06-01", NA, NA)
)
理想情况下,列应该是这样的:
ID Start1 Stop1 Start2 Stop2
1 "2020-01-01" "2020-03-01" "2020-05-01" "2020-06-01"
....
编辑:我已经编辑了这个要求,因为 A) 它被标记为重复,尽管重复问题只是切向相关,B) 因为我真的在寻找一个 tidyverse 解决方案 - 这就是我得到的
您可以获取长格式的数据,在每个 NA
值处创建一个新的开始和停止组。对于每个组 select first
和 last
日期值并以宽格式获取数据。
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -ID) %>%
group_by(ID, grp = cumsum(is.na(value))) %>%
na.omit() %>%
summarise(start = first(value),
stop = last(value)) %>%
mutate(grp = row_number()) %>%
pivot_wider(names_from = grp, values_from = c(start, stop)) %>%
select(ID, order(readr::parse_number(names(.))))
# ID start_1 stop_1 start_2 stop_2
# <int> <chr> <chr> <chr> <chr>
#1 1 2020-01-01 2020-03-01 2020-05-01 2020-06-01
#2 2 2020-02-01 2020-06-01 NA NA
#3 3 2020-01-01 2020-01-01 2020-06-01 2020-06-01
#4 4 2020-01-01 2020-04-01 NA NA
#5 5 2020-01-01 2020-02-01 2020-04-01 2020-05-01
我正在尝试找到一种方法来分离我的数据中的各种间隔,以便与 ID 关联的每一行都有其最短和最长持续时间,按 NA 的月份分开。
我的数据看起来像这样,但有 9 列和 275 行
df <- data.frame("ID" = c(1:5),
"jan" = c("2020-01-01",NA, "2020-01-01", "2020-01-01", "2020-01-01"),
"feb" = c("2020-02-01", "2020-02-01", NA, "2020-02-01", "2020-02-01"),
"mar" = c("2020-03-01", "2020-03-01", NA, "2020-03-01", NA),
"apr" = c(NA, "2020-04-01", NA, "2020-04-01", "2020-04-01"),
"may" = c("2020-05-01", "2020-05-01", NA ,NA, "2020-05-01"),
"jun" = c("2020-06-01", "2020-06-01", "2020-06-01", NA, NA)
)
理想情况下,列应该是这样的:
ID Start1 Stop1 Start2 Stop2
1 "2020-01-01" "2020-03-01" "2020-05-01" "2020-06-01"
....
编辑:我已经编辑了这个要求,因为 A) 它被标记为重复,尽管重复问题只是切向相关,B) 因为我真的在寻找一个 tidyverse 解决方案 - 这就是我得到的
您可以获取长格式的数据,在每个 NA
值处创建一个新的开始和停止组。对于每个组 select first
和 last
日期值并以宽格式获取数据。
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -ID) %>%
group_by(ID, grp = cumsum(is.na(value))) %>%
na.omit() %>%
summarise(start = first(value),
stop = last(value)) %>%
mutate(grp = row_number()) %>%
pivot_wider(names_from = grp, values_from = c(start, stop)) %>%
select(ID, order(readr::parse_number(names(.))))
# ID start_1 stop_1 start_2 stop_2
# <int> <chr> <chr> <chr> <chr>
#1 1 2020-01-01 2020-03-01 2020-05-01 2020-06-01
#2 2 2020-02-01 2020-06-01 NA NA
#3 3 2020-01-01 2020-01-01 2020-06-01 2020-06-01
#4 4 2020-01-01 2020-04-01 NA NA
#5 5 2020-01-01 2020-02-01 2020-04-01 2020-05-01