标记连续观察并创建注册跨度
Flagging continuous observations and creating enrolment spans
我有几个大型注册数据集,我正在尝试创建两件事:
- 我想标记每个不间断的每月观察 (
final_df1
)
- 我想创建一个不间断跨度的数据集 (
final_df2
)
例如:
library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
date=c(ymd("20120101")%m+%months(c(1:5,7:11)),ymd("20120401")%m+%months(c(1:10,12:17,19:22))))
final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))
final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
span_start=c(ymd("20120101")%m+%months(1),
ymd("20120101")%m+%months(7),
ymd("20120401")%m+%months(1),
ymd("20120101")%m+%months(12),
ymd("20120101")%m+%months(19)),
span_end=c(ymd("20120101")%m+%months(5),
ymd("20120101")%m+%months(11),
ymd("20120101")%m+%months(10),
ymd("20120101")%m+%months(17),
ymd("20120101")%m+%months(22))
)
我觉得在 {lubridate} 和 {data.table} 之间一定有一种简单的方法可以做到这一点,但我正在画空白。有什么建议吗?
按 'id' 分组,用之前的值 'date' (lag
) 和当前值 'date' 创建一个 interval
,除以 months
,检查是否小于2,取累计最小值(cummin
)。创建 'find_df_new' 后,我们按 'id' 和 'cont_enroll' 列的 run-length-id 分组,summarise
与 first
和 last
'date' 的值分别创建 'span_start' 和 'span_end'
library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
group_by(id) %>%
mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)),
date) /months(1) < 2)) %>%
ungroup
final_df_new %>%
group_by(id, grp = rleid(cont_enroll2)) %>%
summarise(span_start = first(date), span_end = last(date), .groups = 'drop')
我认为您可以使用 ivs 包很好地解决这个问题。您的日期似乎确实代表了 1 个月的间隔,而 ivs 包专门用于处理此类数据。
我们可以用 iv_groups()
计算 final_df2
,其中 returns 合并所有重叠区间后剩余的 non-overlapping 个区间。
那么每组的第一行final_df2
代表第一个连续的区间,所以只需要检查每个区间是否在这个区间内,判断是否属于不间断集合即可得到final_df1
.
请注意,我的 final_df2
看起来与您的不同,您的编码方式是否有误?
library(dplyr)
library(lubridate)
library(ivs)
df <- tibble(
id = c(
rep("X", 10),
rep("Y", 20)
),
date = c(
ymd("20120101") %m+% months(c(1:5,7:11)),
ymd("20120401") %m+% months(c(1:10,12:17,19:22))
)
)
df
#> # A tibble: 30 × 2
#> id date
#> <chr> <date>
#> 1 X 2012-02-01
#> 2 X 2012-03-01
#> 3 X 2012-04-01
#> 4 X 2012-05-01
#> 5 X 2012-06-01
#> 6 X 2012-08-01
#> 7 X 2012-09-01
#> 8 X 2012-10-01
#> 9 X 2012-11-01
#> 10 X 2012-12-01
#> # … with 20 more rows
df <- df %>%
mutate(start = date, end = date + months(1), .keep = "unused") %>%
mutate(range = iv(start, end), .keep = "unused")
df
#> # A tibble: 30 × 2
#> id range
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-03-01)
#> 2 X [2012-03-01, 2012-04-01)
#> 3 X [2012-04-01, 2012-05-01)
#> 4 X [2012-05-01, 2012-06-01)
#> 5 X [2012-06-01, 2012-07-01)
#> 6 X [2012-08-01, 2012-09-01)
#> 7 X [2012-09-01, 2012-10-01)
#> 8 X [2012-10-01, 2012-11-01)
#> 9 X [2012-11-01, 2012-12-01)
#> 10 X [2012-12-01, 2013-01-01)
#> # … with 20 more rows
# `iv_groups()` returns the groups that remain after merging all overlapping ranges.
# It gives you `final_df2`.
continuous <- df %>%
group_by(id) %>%
summarise(range = iv_groups(range), .groups = "drop")
continuous
#> # A tibble: 5 × 2
#> id range
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-07-01)
#> 2 X [2012-08-01, 2013-01-01)
#> 3 Y [2012-05-01, 2013-03-01)
#> 4 Y [2013-04-01, 2013-10-01)
#> 5 Y [2013-11-01, 2014-03-01)
# The first continuous range per id
first_continuous <- continuous %>%
group_by(id) %>%
slice(1) %>%
ungroup() %>%
rename(range_continuous = range)
first_continuous
#> # A tibble: 2 × 2
#> id range_continuous
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-07-01)
#> 2 Y [2012-05-01, 2013-03-01)
# Join the first continuous range df back onto the original df and see if
# the current `range` falls within the first continuous range or not.
# This gives you `final_df1`.
left_join(df, first_continuous, by = "id") %>%
mutate(continuous = iv_pairwise_overlaps(range, range_continuous, type = "within"))
#> # A tibble: 30 × 4
#> id range range_continuous continuous
#> <chr> <iv<date>> <iv<date>> <lgl>
#> 1 X [2012-02-01, 2012-03-01) [2012-02-01, 2012-07-01) TRUE
#> 2 X [2012-03-01, 2012-04-01) [2012-02-01, 2012-07-01) TRUE
#> 3 X [2012-04-01, 2012-05-01) [2012-02-01, 2012-07-01) TRUE
#> 4 X [2012-05-01, 2012-06-01) [2012-02-01, 2012-07-01) TRUE
#> 5 X [2012-06-01, 2012-07-01) [2012-02-01, 2012-07-01) TRUE
#> 6 X [2012-08-01, 2012-09-01) [2012-02-01, 2012-07-01) FALSE
#> 7 X [2012-09-01, 2012-10-01) [2012-02-01, 2012-07-01) FALSE
#> 8 X [2012-10-01, 2012-11-01) [2012-02-01, 2012-07-01) FALSE
#> 9 X [2012-11-01, 2012-12-01) [2012-02-01, 2012-07-01) FALSE
#> 10 X [2012-12-01, 2013-01-01) [2012-02-01, 2012-07-01) FALSE
#> # … with 20 more rows
由 reprex package (v2.0.1)
创建于 2022-05-13
我有几个大型注册数据集,我正在尝试创建两件事:
- 我想标记每个不间断的每月观察 (
final_df1
) - 我想创建一个不间断跨度的数据集 (
final_df2
)
例如:
library(tidyverse)
library(lubridate)
library(magrittr)
df<-tibble(id=c(rep("X",10),rep("Y",20)),
date=c(ymd("20120101")%m+%months(c(1:5,7:11)),ymd("20120401")%m+%months(c(1:10,12:17,19:22))))
final_df1 <- df %>% mutate(cont_enroll=c(rep(1,5),rep(0,5),rep(1,10),rep(0,10)))
final_df2 <- tibble(id=c(rep("X",2),rep("Y",3)),
span_start=c(ymd("20120101")%m+%months(1),
ymd("20120101")%m+%months(7),
ymd("20120401")%m+%months(1),
ymd("20120101")%m+%months(12),
ymd("20120101")%m+%months(19)),
span_end=c(ymd("20120101")%m+%months(5),
ymd("20120101")%m+%months(11),
ymd("20120101")%m+%months(10),
ymd("20120101")%m+%months(17),
ymd("20120101")%m+%months(22))
)
我觉得在 {lubridate} 和 {data.table} 之间一定有一种简单的方法可以做到这一点,但我正在画空白。有什么建议吗?
按 'id' 分组,用之前的值 'date' (lag
) 和当前值 'date' 创建一个 interval
,除以 months
,检查是否小于2,取累计最小值(cummin
)。创建 'find_df_new' 后,我们按 'id' 和 'cont_enroll' 列的 run-length-id 分组,summarise
与 first
和 last
'date' 的值分别创建 'span_start' 和 'span_end'
library(dplyr)
library(lubridate)
library(data.table)
final_df_new <- df %>%
group_by(id) %>%
mutate(cont_enroll2 = cummin(interval(lag(date, default = first(date)),
date) /months(1) < 2)) %>%
ungroup
final_df_new %>%
group_by(id, grp = rleid(cont_enroll2)) %>%
summarise(span_start = first(date), span_end = last(date), .groups = 'drop')
我认为您可以使用 ivs 包很好地解决这个问题。您的日期似乎确实代表了 1 个月的间隔,而 ivs 包专门用于处理此类数据。
我们可以用 iv_groups()
计算 final_df2
,其中 returns 合并所有重叠区间后剩余的 non-overlapping 个区间。
那么每组的第一行final_df2
代表第一个连续的区间,所以只需要检查每个区间是否在这个区间内,判断是否属于不间断集合即可得到final_df1
.
请注意,我的 final_df2
看起来与您的不同,您的编码方式是否有误?
library(dplyr)
library(lubridate)
library(ivs)
df <- tibble(
id = c(
rep("X", 10),
rep("Y", 20)
),
date = c(
ymd("20120101") %m+% months(c(1:5,7:11)),
ymd("20120401") %m+% months(c(1:10,12:17,19:22))
)
)
df
#> # A tibble: 30 × 2
#> id date
#> <chr> <date>
#> 1 X 2012-02-01
#> 2 X 2012-03-01
#> 3 X 2012-04-01
#> 4 X 2012-05-01
#> 5 X 2012-06-01
#> 6 X 2012-08-01
#> 7 X 2012-09-01
#> 8 X 2012-10-01
#> 9 X 2012-11-01
#> 10 X 2012-12-01
#> # … with 20 more rows
df <- df %>%
mutate(start = date, end = date + months(1), .keep = "unused") %>%
mutate(range = iv(start, end), .keep = "unused")
df
#> # A tibble: 30 × 2
#> id range
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-03-01)
#> 2 X [2012-03-01, 2012-04-01)
#> 3 X [2012-04-01, 2012-05-01)
#> 4 X [2012-05-01, 2012-06-01)
#> 5 X [2012-06-01, 2012-07-01)
#> 6 X [2012-08-01, 2012-09-01)
#> 7 X [2012-09-01, 2012-10-01)
#> 8 X [2012-10-01, 2012-11-01)
#> 9 X [2012-11-01, 2012-12-01)
#> 10 X [2012-12-01, 2013-01-01)
#> # … with 20 more rows
# `iv_groups()` returns the groups that remain after merging all overlapping ranges.
# It gives you `final_df2`.
continuous <- df %>%
group_by(id) %>%
summarise(range = iv_groups(range), .groups = "drop")
continuous
#> # A tibble: 5 × 2
#> id range
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-07-01)
#> 2 X [2012-08-01, 2013-01-01)
#> 3 Y [2012-05-01, 2013-03-01)
#> 4 Y [2013-04-01, 2013-10-01)
#> 5 Y [2013-11-01, 2014-03-01)
# The first continuous range per id
first_continuous <- continuous %>%
group_by(id) %>%
slice(1) %>%
ungroup() %>%
rename(range_continuous = range)
first_continuous
#> # A tibble: 2 × 2
#> id range_continuous
#> <chr> <iv<date>>
#> 1 X [2012-02-01, 2012-07-01)
#> 2 Y [2012-05-01, 2013-03-01)
# Join the first continuous range df back onto the original df and see if
# the current `range` falls within the first continuous range or not.
# This gives you `final_df1`.
left_join(df, first_continuous, by = "id") %>%
mutate(continuous = iv_pairwise_overlaps(range, range_continuous, type = "within"))
#> # A tibble: 30 × 4
#> id range range_continuous continuous
#> <chr> <iv<date>> <iv<date>> <lgl>
#> 1 X [2012-02-01, 2012-03-01) [2012-02-01, 2012-07-01) TRUE
#> 2 X [2012-03-01, 2012-04-01) [2012-02-01, 2012-07-01) TRUE
#> 3 X [2012-04-01, 2012-05-01) [2012-02-01, 2012-07-01) TRUE
#> 4 X [2012-05-01, 2012-06-01) [2012-02-01, 2012-07-01) TRUE
#> 5 X [2012-06-01, 2012-07-01) [2012-02-01, 2012-07-01) TRUE
#> 6 X [2012-08-01, 2012-09-01) [2012-02-01, 2012-07-01) FALSE
#> 7 X [2012-09-01, 2012-10-01) [2012-02-01, 2012-07-01) FALSE
#> 8 X [2012-10-01, 2012-11-01) [2012-02-01, 2012-07-01) FALSE
#> 9 X [2012-11-01, 2012-12-01) [2012-02-01, 2012-07-01) FALSE
#> 10 X [2012-12-01, 2013-01-01) [2012-02-01, 2012-07-01) FALSE
#> # … with 20 more rows
由 reprex package (v2.0.1)
创建于 2022-05-13