计算连续 30 天的观察结果 window
Count observations over rolling 30 day window
我需要创建一个变量来计算每个 ID 在过去 30 天内发生的观察次数。
例如,假设在 2021 年 1 月 2 日(日 / 月 / 年)对 ID“a”进行了观察。如果此观察是 1/1/2021 和 1/2/2021 之间的第一个 ID“a”,则变量必须给出 1。如果是第二个,则为 2,依此类推
这是一个更大的例子:
dat <- tibble::tribble(
~id, ~q, ~date,
"a", 1, "01/01/2021",
"a", 1, "01/01/2021",
"a", 1, "21/01/2021",
"a", 1, "21/01/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"b", 1, "02/02/2021",
"b", 1, "02/02/2021",
"b", 1, "22/02/2021",
"b", 1, "22/02/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021")
dat$date <- lubridate::dmy(dat$date)
结果应该是:
id q date newvar
a 1 01/01/2021 1
a 1 01/01/2021 2
a 1 21/01/2021 3
a 1 21/01/2021 4
a 1 12/02/2021 3
a 1 12/02/2021 4
a 1 12/02/2021 5
a 1 12/02/2021 6
b 1 02/02/2021 1
b 1 02/02/2021 2
b 1 22/02/2021 3
b 1 22/02/2021 4
b 1 13/03/2021 3
b 1 13/03/2021 4
b 1 13/03/2021 5
b 1 13/03/2021 6
非常感谢。
根据左侧数据框的行在指示的条件分组上将数据左连接到自身。我们假设您想要 30 天 window 在当前行结束,但如果您想要 30 天前(31 天 window),则将 29 更改为 30。对于此数据,两者都给出相同的结果。
library(sqldf)
sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
给予:
id q date newvar
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
使用 [.] 将其写入管道以表示输入数据帧有效。
dat %>% {
sqldf("select a.*, count(b.date) as newvar
from [.] a left join [.] b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
}
这在问题数据上的运行速度大约是 sapply 的两倍。
library(microbenchmark)
microbenchmark(
sqldf = sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid"),
sapply = dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
)
给予:
Unit: milliseconds
expr min lq mean median uq max neval cld
sqldf 26.2768 26.77340 27.97039 27.0082 27.29515 63.1032 100 a
sapply 42.8800 43.69345 48.53094 44.1089 45.25275 285.4861 100 b
使用 sapply
和 between
,计算 30 天内在当前观察之前的观察次数。
library(lubridate)
library(dplyr)
dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
# A tibble: 16 x 4
# Groups: id [2]
id q date newvar
<chr> <dbl> <date> <int>
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
我需要创建一个变量来计算每个 ID 在过去 30 天内发生的观察次数。
例如,假设在 2021 年 1 月 2 日(日 / 月 / 年)对 ID“a”进行了观察。如果此观察是 1/1/2021 和 1/2/2021 之间的第一个 ID“a”,则变量必须给出 1。如果是第二个,则为 2,依此类推
这是一个更大的例子:
dat <- tibble::tribble(
~id, ~q, ~date,
"a", 1, "01/01/2021",
"a", 1, "01/01/2021",
"a", 1, "21/01/2021",
"a", 1, "21/01/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"a", 1, "12/02/2021",
"b", 1, "02/02/2021",
"b", 1, "02/02/2021",
"b", 1, "22/02/2021",
"b", 1, "22/02/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021",
"b", 1, "13/03/2021")
dat$date <- lubridate::dmy(dat$date)
结果应该是:
id q date newvar
a 1 01/01/2021 1
a 1 01/01/2021 2
a 1 21/01/2021 3
a 1 21/01/2021 4
a 1 12/02/2021 3
a 1 12/02/2021 4
a 1 12/02/2021 5
a 1 12/02/2021 6
b 1 02/02/2021 1
b 1 02/02/2021 2
b 1 22/02/2021 3
b 1 22/02/2021 4
b 1 13/03/2021 3
b 1 13/03/2021 4
b 1 13/03/2021 5
b 1 13/03/2021 6
非常感谢。
根据左侧数据框的行在指示的条件分组上将数据左连接到自身。我们假设您想要 30 天 window 在当前行结束,但如果您想要 30 天前(31 天 window),则将 29 更改为 30。对于此数据,两者都给出相同的结果。
library(sqldf)
sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
给予:
id q date newvar
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
使用 [.] 将其写入管道以表示输入数据帧有效。
dat %>% {
sqldf("select a.*, count(b.date) as newvar
from [.] a left join [.] b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid")
}
这在问题数据上的运行速度大约是 sapply 的两倍。
library(microbenchmark)
microbenchmark(
sqldf = sqldf("select a.*, count(b.date) as newvar
from dat a left join dat b
on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
group by a.rowid"),
sapply = dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
)
给予:
Unit: milliseconds
expr min lq mean median uq max neval cld
sqldf 26.2768 26.77340 27.97039 27.0082 27.29515 63.1032 100 a
sapply 42.8800 43.69345 48.53094 44.1089 45.25275 285.4861 100 b
使用 sapply
和 between
,计算 30 天内在当前观察之前的观察次数。
library(lubridate)
library(dplyr)
dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
# A tibble: 16 x 4
# Groups: id [2]
id q date newvar
<chr> <dbl> <date> <int>
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6