R 中按组对数据 table 的日期范围滚动求和
Rolling sums across date range on data table by group in R
我有一个数据 table,其中包含随时间变化的事件和子事件,我有兴趣创建两列:(1) 一个事件是否在 5 年内发生的累积滚动总和事件的日期和 (2) 自事件日期起 5 年内发生的子事件(包括事件)的计数。下面是一个代码示例:
dt = data.table(id=c(rep(52749, 14), rep(46760, 15)),
date=c("2007-01-30","2007-03-15","2007-11-27",
"2007-11-29","2008-10-09","2009-04-02",
"2011-01-06","2011-07-26","2012-01-25",
"2015-01-12","2016-09-13","2017-03-21",
"2017-08-29","2017-10-10","2008-01-01",
"2010-07-19","2011-01-14","2011-08-02",
"2011-08-02","2012-02-01","2012-02-01",
"2015-04-28","2015-10-19","2016-05-16",
"2016-12-22","2016-12-23","2017-05-16",
"2017-11-15","2018-02-22"),
idx=c(seq_len(14), seq_len(15)),
count=c(rep(14,14),rep(15,15)),
event=c(1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,
1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0))
生成的内容如下:
id date idx count event
52749 2007-01-30 1 14 1
52749 2007-03-15 2 14 0
52749 2007-11-27 3 14 1
52749 2007-11-29 4 14 0
52749 2008-10-09 5 14 1
52749 2009-04-02 6 14 0
52749 2011-01-06 7 14 1
52749 2011-07-26 8 14 1
52749 2012-01-25 9 14 0
52749 2015-01-12 10 14 1
52749 2016-09-13 11 14 1
52749 2017-03-21 12 14 1
52749 2017-08-29 13 14 0
52749 2017-10-10 14 14 0
46760 2008-01-01 1 15 1
46760 2010-07-19 2 15 1
46760 2011-01-14 3 15 0
46760 2011-08-02 4 15 1
46760 2011-08-02 5 15 0
46760 2012-02-01 6 15 1
46760 2012-02-01 7 15 0
46760 2015-04-28 8 15 1
46760 2015-10-19 9 15 0
46760 2016-05-16 10 15 1
46760 2016-12-22 11 15 1
46760 2016-12-23 12 15 0
46760 2017-05-16 13 15 0
46760 2017-11-15 14 15 1
46760 2018-02-22 15 15 0
我主要需要的是:
id date idx count event num_event_5yr_fu num_subevents
52749 2007-01-30 1 14 1 4 8
52749 2007-03-15 2 14 0 NA NA
52749 2007-11-27 3 14 1 3 6
52749 2007-11-29 4 14 0 NA NA
52749 2008-10-09 5 14 1 2 4
52749 2009-04-02 6 14 0 NA NA
52749 2011-01-06 7 14 1 2 3
52749 2011-07-26 8 14 1 1 2
52749 2012-01-25 9 14 0 NA NA
52749 2015-01-12 10 14 1 2 4
52749 2016-09-13 11 14 1 1 3
52749 2017-03-21 12 14 1 0 2
52749 2017-08-29 13 14 0 NA NA
52749 2017-10-10 14 14 0 NA NA
46760 2008-01-01 1 15 1 3 6
46760 2010-07-19 2 15 1 3 6
46760 2011-01-14 3 15 0 NA NA
46760 2011-08-02 4 15 1 3 6
46760 2011-08-02 5 15 0 NA NA
46760 2012-02-01 6 15 1 3 6
46760 2012-02-01 7 15 0 NA NA
46760 2015-04-28 8 15 1 3 7
46760 2015-10-19 9 15 0 NA NA
46760 2016-05-16 10 15 1 2 5
46760 2016-12-22 11 15 1 1 4
46760 2016-12-23 12 15 0 NA NA
46760 2017-05-16 13 15 0 NA NA
46760 2017-11-15 14 15 1 0 1
46760 2018-02-22 15 15 0 NA NA
其中num_event_5yr_fu
计算的是自事件发生之日起5年内(不包括事件发生之日)事件发生的次数(或累计总和),而num_subevents
统计的是自事件发生之日起(不包括事件发生之日)5年内的记录数。
我已经在这方面工作了很长一段时间并且被卡住了,非常感谢您就如何实现这一目标提供一些意见。谢谢。
这是一个使用非等连接的 data.table 方法:
library(lubridate)
dt[, date := as.Date(date)]
dt[, end_date := date]
year(dt$end_date) <- year(dt$end_date) + 5
dt[, rowid := .I]
event_count = dt[dt, on = .(date < date , end_date >= date, id),
allow.cartesian=TRUE][!is.na(rowid) & event == 1,
.(events = sum(i.event), num_subevents = .N),
by = .(rowid, id)]
dt[event_count, on = .(rowid, id), `:=`(num_event_5yr_fu = i.events,
num_subevents = i.num_subevents)]
dt[, c("end_date", "rowid") := NULL]
dt
# id date idx count event num_event_5yr_fu num_subevents
# 1: 52749 2007-01-30 1 14 1 4 8
# 2: 52749 2007-03-15 2 14 0 NA NA
# 3: 52749 2007-11-27 3 14 1 3 6
# 4: 52749 2007-11-29 4 14 0 NA NA
# 5: 52749 2008-10-09 5 14 1 2 4
# 6: 52749 2009-04-02 6 14 0 NA NA
# 7: 52749 2011-01-06 7 14 1 2 3
# 8: 52749 2011-07-26 8 14 1 1 2
# 9: 52749 2012-01-25 9 14 0 NA NA
# 10: 52749 2015-01-12 10 14 1 2 4
# 11: 52749 2016-09-13 11 14 1 1 3
# 12: 52749 2017-03-21 12 14 1 0 2
# 13: 52749 2017-08-29 13 14 0 NA NA
# 14: 52749 2017-10-10 14 14 0 NA NA
# 15: 46760 2008-01-01 1 15 1 3 6
# 16: 46760 2010-07-19 2 15 1 3 6
# 17: 46760 2011-01-14 3 15 0 NA NA
# 18: 46760 2011-08-02 4 15 1 3 5
# 19: 46760 2011-08-02 5 15 0 NA NA
# 20: 46760 2012-02-01 6 15 1 3 5
# 21: 46760 2012-02-01 7 15 0 NA NA
# 22: 46760 2015-04-28 8 15 1 3 7
# 23: 46760 2015-10-19 9 15 0 NA NA
# 24: 46760 2016-05-16 10 15 1 2 5
# 25: 46760 2016-12-22 11 15 1 1 4
# 26: 46760 2016-12-23 12 15 0 NA NA
# 27: 46760 2017-05-16 13 15 0 NA NA
# 28: 46760 2017-11-15 14 15 1 0 1
# 29: 46760 2018-02-22 15 15 0 NA NA
另一个选项:
library(data.table)
library(lubridate)
dt[, date := as.Date(date)][
, num_event_5yr_fu := sapply(date,
function(x) sum(event[between(date, x + 1, x + years(5))])), by = id
][, num_subevents := sapply(date,
function(x) length(event[between(date, x + 1, x + years(5))])), by = id
][event == 0, `:=` (num_event_5yr_fu = NA, num_subevents = NA)]
输出:
id date idx count event num_event_5yr_fu num_subevents
1: 52749 2007-01-30 1 14 1 4 8
2: 52749 2007-03-15 2 14 0 NA NA
3: 52749 2007-11-27 3 14 1 3 6
4: 52749 2007-11-29 4 14 0 NA NA
5: 52749 2008-10-09 5 14 1 2 4
6: 52749 2009-04-02 6 14 0 NA NA
7: 52749 2011-01-06 7 14 1 2 3
8: 52749 2011-07-26 8 14 1 1 2
9: 52749 2012-01-25 9 14 0 NA NA
10: 52749 2015-01-12 10 14 1 2 4
11: 52749 2016-09-13 11 14 1 1 3
12: 52749 2017-03-21 12 14 1 0 2
13: 52749 2017-08-29 13 14 0 NA NA
14: 52749 2017-10-10 14 14 0 NA NA
15: 46760 2008-01-01 1 15 1 3 6
16: 46760 2010-07-19 2 15 1 3 6
17: 46760 2011-01-14 3 15 0 NA NA
18: 46760 2011-08-02 4 15 1 3 5
19: 46760 2011-08-02 5 15 0 NA NA
20: 46760 2012-02-01 6 15 1 3 5
21: 46760 2012-02-01 7 15 0 NA NA
22: 46760 2015-04-28 8 15 1 3 7
23: 46760 2015-10-19 9 15 0 NA NA
24: 46760 2016-05-16 10 15 1 2 5
25: 46760 2016-12-22 11 15 1 1 4
26: 46760 2016-12-23 12 15 0 NA NA
27: 46760 2017-05-16 13 15 0 NA NA
28: 46760 2017-11-15 14 15 1 0 1
29: 46760 2018-02-22 15 15 0 NA NA
OP 的规格与 OP 的预期结果存在偏差。
OP 已指定 num_event_5yr_fu
正在计算自事件日期(不包括事件日期)起 5 年内事件发生的次数(或累计总和) ), num_subevents
统计的是自事件发生之日起(不包括事件发生之日)5年内的记录数。
然而,在 OP 的预期结果中,num_subevents
正在计算自事件日期起 5 年内 记录 的数量(不包括事件 行(=记录?).
因此,提供了涵盖两种解释的两种解决方案。
再现OP的预期结果
这种方法重现了 OP 的预期结果(与 and 的答案形成对比,后者按描述实现了 OP 的要求)。
此方法在非 equi 连接中聚合和更新。它在连接中包含事件日期,但更正了聚合以减少一个事件的计数。
library(data.table)
new_cols <- c("num_event_5yr_fu", "num_subevents")
result <- dt[
, date := as.Date(date)][
.(id = id, start = date, end = date + lubridate::years(5)),
on = .(id, date >= start, date <= end),
new_cols := .(sum(event) - 1, .N - 1L), by = .EACHI][
event == 0, new_cols := NA][]
result
id date idx count event num_event_5yr_fu num_subevents
1: 52749 2007-01-30 1 14 1 4 8
2: 52749 2007-03-15 2 14 0 NA NA
3: 52749 2007-11-27 3 14 1 3 6
4: 52749 2007-11-29 4 14 0 NA NA
5: 52749 2008-10-09 5 14 1 2 4
6: 52749 2009-04-02 6 14 0 NA NA
7: 52749 2011-01-06 7 14 1 2 3
8: 52749 2011-07-26 8 14 1 1 2
9: 52749 2012-01-25 9 14 0 NA NA
10: 52749 2015-01-12 10 14 1 2 4
11: 52749 2016-09-13 11 14 1 1 3
12: 52749 2017-03-21 12 14 1 0 2
13: 52749 2017-08-29 13 14 0 NA NA
14: 52749 2017-10-10 14 14 0 NA NA
15: 46760 2008-01-01 1 15 1 3 6
16: 46760 2010-07-19 2 15 1 3 6
17: 46760 2011-01-14 3 15 0 NA NA
18: 46760 2011-08-02 4 15 1 3 6
19: 46760 2011-08-02 5 15 0 NA NA
20: 46760 2012-02-01 6 15 1 3 6
21: 46760 2012-02-01 7 15 0 NA NA
22: 46760 2015-04-28 8 15 1 3 7
23: 46760 2015-10-19 9 15 0 NA NA
24: 46760 2016-05-16 10 15 1 2 5
25: 46760 2016-12-22 11 15 1 1 4
26: 46760 2016-12-23 12 15 0 NA NA
27: 46760 2017-05-16 13 15 0 NA NA
28: 46760 2017-11-15 14 15 1 0 1
29: 46760 2018-02-22 15 15 0 NA NA
id date idx count event num_event_5yr_fu num_subevents
请注意第 18 到 20 行(id
== 46760 和 date
在 2011-08-02 和 2012-02-01 之间)符合 OP 的预期结果。
这可以通过
验证
all.equal(result, expected, check.attributes = FALSE)
[1] TRUE
复制其他答案
此处,仅统计日期大于事件日期的记录。
library(data.table)
tmp <- dt[, date := as.Date(date)][
dt[event == 1, .(id, start = date, end = date + lubridate::years(5))],
on = .(id, date > start, date <= end),
.(event = 1, sum(event), .N), by = .EACHI]
result <- dt[tmp, on = .(id, event, date),
c("num_event_5yr_fu", "num_subevents") := .(V2, N)][]
result
id date idx count event num_event_5yr_fu num_subevents
1: 52749 2007-01-30 1 14 1 4 8
2: 52749 2007-03-15 2 14 0 NA NA
3: 52749 2007-11-27 3 14 1 3 6
4: 52749 2007-11-29 4 14 0 NA NA
5: 52749 2008-10-09 5 14 1 2 4
6: 52749 2009-04-02 6 14 0 NA NA
7: 52749 2011-01-06 7 14 1 2 3
8: 52749 2011-07-26 8 14 1 1 2
9: 52749 2012-01-25 9 14 0 NA NA
10: 52749 2015-01-12 10 14 1 2 4
11: 52749 2016-09-13 11 14 1 1 3
12: 52749 2017-03-21 12 14 1 0 2
13: 52749 2017-08-29 13 14 0 NA NA
14: 52749 2017-10-10 14 14 0 NA NA
15: 46760 2008-01-01 1 15 1 3 6
16: 46760 2010-07-19 2 15 1 3 6
17: 46760 2011-01-14 3 15 0 NA NA
18: 46760 2011-08-02 4 15 1 3 5
19: 46760 2011-08-02 5 15 0 NA NA
20: 46760 2012-02-01 6 15 1 3 5
21: 46760 2012-02-01 7 15 0 NA NA
22: 46760 2015-04-28 8 15 1 3 7
23: 46760 2015-10-19 9 15 0 NA NA
24: 46760 2016-05-16 10 15 1 2 5
25: 46760 2016-12-22 11 15 1 1 4
26: 46760 2016-12-23 12 15 0 NA NA
27: 46760 2017-05-16 13 15 0 NA NA
28: 46760 2017-11-15 14 15 1 0 1
29: 46760 2018-02-22 15 15 0 NA NA
id date idx count event num_event_5yr_fu num_subevents
中间结果为
tmp
id date date event V2 N
1: 52749 2007-01-30 2012-01-30 1 4 8
2: 52749 2007-11-27 2012-11-27 1 3 6
3: 52749 2008-10-09 2013-10-09 1 2 4
4: 52749 2011-01-06 2016-01-06 1 2 3
5: 52749 2011-07-26 2016-07-26 1 1 2
6: 52749 2015-01-12 2020-01-12 1 2 4
7: 52749 2016-09-13 2021-09-13 1 1 3
8: 52749 2017-03-21 2022-03-21 1 0 2
9: 46760 2008-01-01 2013-01-01 1 3 6
10: 46760 2010-07-19 2015-07-19 1 3 6
11: 46760 2011-08-02 2016-08-02 1 3 5
12: 46760 2012-02-01 2017-02-01 1 3 5
13: 46760 2015-04-28 2020-04-28 1 3 7
14: 46760 2016-05-16 2021-05-16 1 2 5
15: 46760 2016-12-22 2021-12-22 1 1 4
16: 46760 2017-11-15 2022-11-15 1 0 1
它仅包含 event == 1
的结果。在最后的 update join 中,event
包含在要加入的键中。对于 event == 1
的行没有匹配项,因此新列自动设置为 NA
。
数据
dt = data.table(id=c(rep(52749, 14), rep(46760, 15)),
date=c("2007-01-30","2007-03-15","2007-11-27",
"2007-11-29","2008-10-09","2009-04-02",
"2011-01-06","2011-07-26","2012-01-25",
"2015-01-12","2016-09-13","2017-03-21",
"2017-08-29","2017-10-10","2008-01-01",
"2010-07-19","2011-01-14","2011-08-02",
"2011-08-02","2012-02-01","2012-02-01",
"2015-04-28","2015-10-19","2016-05-16",
"2016-12-22","2016-12-23","2017-05-16",
"2017-11-15","2018-02-22"),
idx=c(seq_len(14), seq_len(15)),
count=c(rep(14,14),rep(15,15)),
event=c(1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,
1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0))
expected <-
fread("id date idx count event num_event_5yr_fu num_subevents
52749 2007-01-30 1 14 1 4 8
52749 2007-03-15 2 14 0 NA NA
52749 2007-11-27 3 14 1 3 6
52749 2007-11-29 4 14 0 NA NA
52749 2008-10-09 5 14 1 2 4
52749 2009-04-02 6 14 0 NA NA
52749 2011-01-06 7 14 1 2 3
52749 2011-07-26 8 14 1 1 2
52749 2012-01-25 9 14 0 NA NA
52749 2015-01-12 10 14 1 2 4
52749 2016-09-13 11 14 1 1 3
52749 2017-03-21 12 14 1 0 2
52749 2017-08-29 13 14 0 NA NA
52749 2017-10-10 14 14 0 NA NA
46760 2008-01-01 1 15 1 3 6
46760 2010-07-19 2 15 1 3 6
46760 2011-01-14 3 15 0 NA NA
46760 2011-08-02 4 15 1 3 6
46760 2011-08-02 5 15 0 NA NA
46760 2012-02-01 6 15 1 3 6
46760 2012-02-01 7 15 0 NA NA
46760 2015-04-28 8 15 1 3 7
46760 2015-10-19 9 15 0 NA NA
46760 2016-05-16 10 15 1 2 5
46760 2016-12-22 11 15 1 1 4
46760 2016-12-23 12 15 0 NA NA
46760 2017-05-16 13 15 0 NA NA
46760 2017-11-15 14 15 1 0 1
46760 2018-02-22 15 15 0 NA NA")[
, date := as.Date(date)]
我有一个数据 table,其中包含随时间变化的事件和子事件,我有兴趣创建两列:(1) 一个事件是否在 5 年内发生的累积滚动总和事件的日期和 (2) 自事件日期起 5 年内发生的子事件(包括事件)的计数。下面是一个代码示例:
dt = data.table(id=c(rep(52749, 14), rep(46760, 15)),
date=c("2007-01-30","2007-03-15","2007-11-27",
"2007-11-29","2008-10-09","2009-04-02",
"2011-01-06","2011-07-26","2012-01-25",
"2015-01-12","2016-09-13","2017-03-21",
"2017-08-29","2017-10-10","2008-01-01",
"2010-07-19","2011-01-14","2011-08-02",
"2011-08-02","2012-02-01","2012-02-01",
"2015-04-28","2015-10-19","2016-05-16",
"2016-12-22","2016-12-23","2017-05-16",
"2017-11-15","2018-02-22"),
idx=c(seq_len(14), seq_len(15)),
count=c(rep(14,14),rep(15,15)),
event=c(1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,
1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0))
生成的内容如下:
id date idx count event
52749 2007-01-30 1 14 1
52749 2007-03-15 2 14 0
52749 2007-11-27 3 14 1
52749 2007-11-29 4 14 0
52749 2008-10-09 5 14 1
52749 2009-04-02 6 14 0
52749 2011-01-06 7 14 1
52749 2011-07-26 8 14 1
52749 2012-01-25 9 14 0
52749 2015-01-12 10 14 1
52749 2016-09-13 11 14 1
52749 2017-03-21 12 14 1
52749 2017-08-29 13 14 0
52749 2017-10-10 14 14 0
46760 2008-01-01 1 15 1
46760 2010-07-19 2 15 1
46760 2011-01-14 3 15 0
46760 2011-08-02 4 15 1
46760 2011-08-02 5 15 0
46760 2012-02-01 6 15 1
46760 2012-02-01 7 15 0
46760 2015-04-28 8 15 1
46760 2015-10-19 9 15 0
46760 2016-05-16 10 15 1
46760 2016-12-22 11 15 1
46760 2016-12-23 12 15 0
46760 2017-05-16 13 15 0
46760 2017-11-15 14 15 1
46760 2018-02-22 15 15 0
我主要需要的是:
id date idx count event num_event_5yr_fu num_subevents
52749 2007-01-30 1 14 1 4 8
52749 2007-03-15 2 14 0 NA NA
52749 2007-11-27 3 14 1 3 6
52749 2007-11-29 4 14 0 NA NA
52749 2008-10-09 5 14 1 2 4
52749 2009-04-02 6 14 0 NA NA
52749 2011-01-06 7 14 1 2 3
52749 2011-07-26 8 14 1 1 2
52749 2012-01-25 9 14 0 NA NA
52749 2015-01-12 10 14 1 2 4
52749 2016-09-13 11 14 1 1 3
52749 2017-03-21 12 14 1 0 2
52749 2017-08-29 13 14 0 NA NA
52749 2017-10-10 14 14 0 NA NA
46760 2008-01-01 1 15 1 3 6
46760 2010-07-19 2 15 1 3 6
46760 2011-01-14 3 15 0 NA NA
46760 2011-08-02 4 15 1 3 6
46760 2011-08-02 5 15 0 NA NA
46760 2012-02-01 6 15 1 3 6
46760 2012-02-01 7 15 0 NA NA
46760 2015-04-28 8 15 1 3 7
46760 2015-10-19 9 15 0 NA NA
46760 2016-05-16 10 15 1 2 5
46760 2016-12-22 11 15 1 1 4
46760 2016-12-23 12 15 0 NA NA
46760 2017-05-16 13 15 0 NA NA
46760 2017-11-15 14 15 1 0 1
46760 2018-02-22 15 15 0 NA NA
其中num_event_5yr_fu
计算的是自事件发生之日起5年内(不包括事件发生之日)事件发生的次数(或累计总和),而num_subevents
统计的是自事件发生之日起(不包括事件发生之日)5年内的记录数。
我已经在这方面工作了很长一段时间并且被卡住了,非常感谢您就如何实现这一目标提供一些意见。谢谢。
这是一个使用非等连接的 data.table 方法:
library(lubridate)
dt[, date := as.Date(date)]
dt[, end_date := date]
year(dt$end_date) <- year(dt$end_date) + 5
dt[, rowid := .I]
event_count = dt[dt, on = .(date < date , end_date >= date, id),
allow.cartesian=TRUE][!is.na(rowid) & event == 1,
.(events = sum(i.event), num_subevents = .N),
by = .(rowid, id)]
dt[event_count, on = .(rowid, id), `:=`(num_event_5yr_fu = i.events,
num_subevents = i.num_subevents)]
dt[, c("end_date", "rowid") := NULL]
dt
# id date idx count event num_event_5yr_fu num_subevents
# 1: 52749 2007-01-30 1 14 1 4 8
# 2: 52749 2007-03-15 2 14 0 NA NA
# 3: 52749 2007-11-27 3 14 1 3 6
# 4: 52749 2007-11-29 4 14 0 NA NA
# 5: 52749 2008-10-09 5 14 1 2 4
# 6: 52749 2009-04-02 6 14 0 NA NA
# 7: 52749 2011-01-06 7 14 1 2 3
# 8: 52749 2011-07-26 8 14 1 1 2
# 9: 52749 2012-01-25 9 14 0 NA NA
# 10: 52749 2015-01-12 10 14 1 2 4
# 11: 52749 2016-09-13 11 14 1 1 3
# 12: 52749 2017-03-21 12 14 1 0 2
# 13: 52749 2017-08-29 13 14 0 NA NA
# 14: 52749 2017-10-10 14 14 0 NA NA
# 15: 46760 2008-01-01 1 15 1 3 6
# 16: 46760 2010-07-19 2 15 1 3 6
# 17: 46760 2011-01-14 3 15 0 NA NA
# 18: 46760 2011-08-02 4 15 1 3 5
# 19: 46760 2011-08-02 5 15 0 NA NA
# 20: 46760 2012-02-01 6 15 1 3 5
# 21: 46760 2012-02-01 7 15 0 NA NA
# 22: 46760 2015-04-28 8 15 1 3 7
# 23: 46760 2015-10-19 9 15 0 NA NA
# 24: 46760 2016-05-16 10 15 1 2 5
# 25: 46760 2016-12-22 11 15 1 1 4
# 26: 46760 2016-12-23 12 15 0 NA NA
# 27: 46760 2017-05-16 13 15 0 NA NA
# 28: 46760 2017-11-15 14 15 1 0 1
# 29: 46760 2018-02-22 15 15 0 NA NA
另一个选项:
library(data.table)
library(lubridate)
dt[, date := as.Date(date)][
, num_event_5yr_fu := sapply(date,
function(x) sum(event[between(date, x + 1, x + years(5))])), by = id
][, num_subevents := sapply(date,
function(x) length(event[between(date, x + 1, x + years(5))])), by = id
][event == 0, `:=` (num_event_5yr_fu = NA, num_subevents = NA)]
输出:
id date idx count event num_event_5yr_fu num_subevents
1: 52749 2007-01-30 1 14 1 4 8
2: 52749 2007-03-15 2 14 0 NA NA
3: 52749 2007-11-27 3 14 1 3 6
4: 52749 2007-11-29 4 14 0 NA NA
5: 52749 2008-10-09 5 14 1 2 4
6: 52749 2009-04-02 6 14 0 NA NA
7: 52749 2011-01-06 7 14 1 2 3
8: 52749 2011-07-26 8 14 1 1 2
9: 52749 2012-01-25 9 14 0 NA NA
10: 52749 2015-01-12 10 14 1 2 4
11: 52749 2016-09-13 11 14 1 1 3
12: 52749 2017-03-21 12 14 1 0 2
13: 52749 2017-08-29 13 14 0 NA NA
14: 52749 2017-10-10 14 14 0 NA NA
15: 46760 2008-01-01 1 15 1 3 6
16: 46760 2010-07-19 2 15 1 3 6
17: 46760 2011-01-14 3 15 0 NA NA
18: 46760 2011-08-02 4 15 1 3 5
19: 46760 2011-08-02 5 15 0 NA NA
20: 46760 2012-02-01 6 15 1 3 5
21: 46760 2012-02-01 7 15 0 NA NA
22: 46760 2015-04-28 8 15 1 3 7
23: 46760 2015-10-19 9 15 0 NA NA
24: 46760 2016-05-16 10 15 1 2 5
25: 46760 2016-12-22 11 15 1 1 4
26: 46760 2016-12-23 12 15 0 NA NA
27: 46760 2017-05-16 13 15 0 NA NA
28: 46760 2017-11-15 14 15 1 0 1
29: 46760 2018-02-22 15 15 0 NA NA
OP 的规格与 OP 的预期结果存在偏差。
OP 已指定 num_event_5yr_fu
正在计算自事件日期(不包括事件日期)起 5 年内事件发生的次数(或累计总和) ), num_subevents
统计的是自事件发生之日起(不包括事件发生之日)5年内的记录数。
然而,在 OP 的预期结果中,num_subevents
正在计算自事件日期起 5 年内 记录 的数量(不包括事件 行(=记录?).
因此,提供了涵盖两种解释的两种解决方案。
再现OP的预期结果
这种方法重现了 OP 的预期结果(与
此方法在非 equi 连接中聚合和更新。它在连接中包含事件日期,但更正了聚合以减少一个事件的计数。
library(data.table)
new_cols <- c("num_event_5yr_fu", "num_subevents")
result <- dt[
, date := as.Date(date)][
.(id = id, start = date, end = date + lubridate::years(5)),
on = .(id, date >= start, date <= end),
new_cols := .(sum(event) - 1, .N - 1L), by = .EACHI][
event == 0, new_cols := NA][]
result
id date idx count event num_event_5yr_fu num_subevents 1: 52749 2007-01-30 1 14 1 4 8 2: 52749 2007-03-15 2 14 0 NA NA 3: 52749 2007-11-27 3 14 1 3 6 4: 52749 2007-11-29 4 14 0 NA NA 5: 52749 2008-10-09 5 14 1 2 4 6: 52749 2009-04-02 6 14 0 NA NA 7: 52749 2011-01-06 7 14 1 2 3 8: 52749 2011-07-26 8 14 1 1 2 9: 52749 2012-01-25 9 14 0 NA NA 10: 52749 2015-01-12 10 14 1 2 4 11: 52749 2016-09-13 11 14 1 1 3 12: 52749 2017-03-21 12 14 1 0 2 13: 52749 2017-08-29 13 14 0 NA NA 14: 52749 2017-10-10 14 14 0 NA NA 15: 46760 2008-01-01 1 15 1 3 6 16: 46760 2010-07-19 2 15 1 3 6 17: 46760 2011-01-14 3 15 0 NA NA 18: 46760 2011-08-02 4 15 1 3 6 19: 46760 2011-08-02 5 15 0 NA NA 20: 46760 2012-02-01 6 15 1 3 6 21: 46760 2012-02-01 7 15 0 NA NA 22: 46760 2015-04-28 8 15 1 3 7 23: 46760 2015-10-19 9 15 0 NA NA 24: 46760 2016-05-16 10 15 1 2 5 25: 46760 2016-12-22 11 15 1 1 4 26: 46760 2016-12-23 12 15 0 NA NA 27: 46760 2017-05-16 13 15 0 NA NA 28: 46760 2017-11-15 14 15 1 0 1 29: 46760 2018-02-22 15 15 0 NA NA id date idx count event num_event_5yr_fu num_subevents
请注意第 18 到 20 行(id
== 46760 和 date
在 2011-08-02 和 2012-02-01 之间)符合 OP 的预期结果。
这可以通过
验证all.equal(result, expected, check.attributes = FALSE)
[1] TRUE
复制其他答案
此处,仅统计日期大于事件日期的记录。
library(data.table)
tmp <- dt[, date := as.Date(date)][
dt[event == 1, .(id, start = date, end = date + lubridate::years(5))],
on = .(id, date > start, date <= end),
.(event = 1, sum(event), .N), by = .EACHI]
result <- dt[tmp, on = .(id, event, date),
c("num_event_5yr_fu", "num_subevents") := .(V2, N)][]
result
id date idx count event num_event_5yr_fu num_subevents 1: 52749 2007-01-30 1 14 1 4 8 2: 52749 2007-03-15 2 14 0 NA NA 3: 52749 2007-11-27 3 14 1 3 6 4: 52749 2007-11-29 4 14 0 NA NA 5: 52749 2008-10-09 5 14 1 2 4 6: 52749 2009-04-02 6 14 0 NA NA 7: 52749 2011-01-06 7 14 1 2 3 8: 52749 2011-07-26 8 14 1 1 2 9: 52749 2012-01-25 9 14 0 NA NA 10: 52749 2015-01-12 10 14 1 2 4 11: 52749 2016-09-13 11 14 1 1 3 12: 52749 2017-03-21 12 14 1 0 2 13: 52749 2017-08-29 13 14 0 NA NA 14: 52749 2017-10-10 14 14 0 NA NA 15: 46760 2008-01-01 1 15 1 3 6 16: 46760 2010-07-19 2 15 1 3 6 17: 46760 2011-01-14 3 15 0 NA NA 18: 46760 2011-08-02 4 15 1 3 5 19: 46760 2011-08-02 5 15 0 NA NA 20: 46760 2012-02-01 6 15 1 3 5 21: 46760 2012-02-01 7 15 0 NA NA 22: 46760 2015-04-28 8 15 1 3 7 23: 46760 2015-10-19 9 15 0 NA NA 24: 46760 2016-05-16 10 15 1 2 5 25: 46760 2016-12-22 11 15 1 1 4 26: 46760 2016-12-23 12 15 0 NA NA 27: 46760 2017-05-16 13 15 0 NA NA 28: 46760 2017-11-15 14 15 1 0 1 29: 46760 2018-02-22 15 15 0 NA NA id date idx count event num_event_5yr_fu num_subevents
中间结果为
tmp
id date date event V2 N 1: 52749 2007-01-30 2012-01-30 1 4 8 2: 52749 2007-11-27 2012-11-27 1 3 6 3: 52749 2008-10-09 2013-10-09 1 2 4 4: 52749 2011-01-06 2016-01-06 1 2 3 5: 52749 2011-07-26 2016-07-26 1 1 2 6: 52749 2015-01-12 2020-01-12 1 2 4 7: 52749 2016-09-13 2021-09-13 1 1 3 8: 52749 2017-03-21 2022-03-21 1 0 2 9: 46760 2008-01-01 2013-01-01 1 3 6 10: 46760 2010-07-19 2015-07-19 1 3 6 11: 46760 2011-08-02 2016-08-02 1 3 5 12: 46760 2012-02-01 2017-02-01 1 3 5 13: 46760 2015-04-28 2020-04-28 1 3 7 14: 46760 2016-05-16 2021-05-16 1 2 5 15: 46760 2016-12-22 2021-12-22 1 1 4 16: 46760 2017-11-15 2022-11-15 1 0 1
它仅包含 event == 1
的结果。在最后的 update join 中,event
包含在要加入的键中。对于 event == 1
的行没有匹配项,因此新列自动设置为 NA
。
数据
dt = data.table(id=c(rep(52749, 14), rep(46760, 15)),
date=c("2007-01-30","2007-03-15","2007-11-27",
"2007-11-29","2008-10-09","2009-04-02",
"2011-01-06","2011-07-26","2012-01-25",
"2015-01-12","2016-09-13","2017-03-21",
"2017-08-29","2017-10-10","2008-01-01",
"2010-07-19","2011-01-14","2011-08-02",
"2011-08-02","2012-02-01","2012-02-01",
"2015-04-28","2015-10-19","2016-05-16",
"2016-12-22","2016-12-23","2017-05-16",
"2017-11-15","2018-02-22"),
idx=c(seq_len(14), seq_len(15)),
count=c(rep(14,14),rep(15,15)),
event=c(1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,
1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0))
expected <-
fread("id date idx count event num_event_5yr_fu num_subevents
52749 2007-01-30 1 14 1 4 8
52749 2007-03-15 2 14 0 NA NA
52749 2007-11-27 3 14 1 3 6
52749 2007-11-29 4 14 0 NA NA
52749 2008-10-09 5 14 1 2 4
52749 2009-04-02 6 14 0 NA NA
52749 2011-01-06 7 14 1 2 3
52749 2011-07-26 8 14 1 1 2
52749 2012-01-25 9 14 0 NA NA
52749 2015-01-12 10 14 1 2 4
52749 2016-09-13 11 14 1 1 3
52749 2017-03-21 12 14 1 0 2
52749 2017-08-29 13 14 0 NA NA
52749 2017-10-10 14 14 0 NA NA
46760 2008-01-01 1 15 1 3 6
46760 2010-07-19 2 15 1 3 6
46760 2011-01-14 3 15 0 NA NA
46760 2011-08-02 4 15 1 3 6
46760 2011-08-02 5 15 0 NA NA
46760 2012-02-01 6 15 1 3 6
46760 2012-02-01 7 15 0 NA NA
46760 2015-04-28 8 15 1 3 7
46760 2015-10-19 9 15 0 NA NA
46760 2016-05-16 10 15 1 2 5
46760 2016-12-22 11 15 1 1 4
46760 2016-12-23 12 15 0 NA NA
46760 2017-05-16 13 15 0 NA NA
46760 2017-11-15 14 15 1 0 1
46760 2018-02-22 15 15 0 NA NA")[
, date := as.Date(date)]