计算连续 30 天的观察结果 window

Count observations over rolling 30 day window

我需要创建一个变量来计算每个 ID 在过去 30 天内发生的观察次数。

例如,假设在 2021 年 1 月 2 日(日 / 月 / 年)对 ID“a”进行了观察。如果此观察是 1/1/2021 和 1/2/2021 之间的第一个 ID“a”,则变量必须给出 1。如果是第二个,则为 2,依此类推

这是一个更大的例子:

dat <- tibble::tribble(
  ~id,  ~q,   ~date,
  "a",   1,   "01/01/2021",
  "a",   1,   "01/01/2021",
  "a",   1,   "21/01/2021",
  "a",   1,   "21/01/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "a",   1,   "12/02/2021",
  "b",   1,   "02/02/2021",
  "b",   1,   "02/02/2021",
  "b",   1,   "22/02/2021",
  "b",   1,   "22/02/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021",
  "b",   1,   "13/03/2021")
dat$date <- lubridate::dmy(dat$date)

结果应该是:

id  q   date    newvar
a   1   01/01/2021  1
a   1   01/01/2021  2
a   1   21/01/2021  3
a   1   21/01/2021  4
a   1   12/02/2021  3
a   1   12/02/2021  4
a   1   12/02/2021  5
a   1   12/02/2021  6
b   1   02/02/2021  1
b   1   02/02/2021  2
b   1   22/02/2021  3
b   1   22/02/2021  4
b   1   13/03/2021  3
b   1   13/03/2021  4
b   1   13/03/2021  5
b   1   13/03/2021  6

非常感谢。

根据左侧数据框的行在指示的条件分组上将数据左连接到自身。我们假设您想要 30 天 window 在当前行结束,但如果您想要 30 天前(31 天 window),则将 29 更改为 30。对于此数据,两者都给出相同的结果。

library(sqldf)

sqldf("select a.*, count(b.date) as newvar
  from dat a left join dat b
  on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
  group by a.rowid")

给予:

   id q       date        newvar
1   a 1 2021-01-01             1
2   a 1 2021-01-01             2
3   a 1 2021-01-21             3
4   a 1 2021-01-21             4
5   a 1 2021-02-12             3
6   a 1 2021-02-12             4
7   a 1 2021-02-12             5
8   a 1 2021-02-12             6
9   b 1 2021-02-02             1
10  b 1 2021-02-02             2
11  b 1 2021-02-22             3
12  b 1 2021-02-22             4
13  b 1 2021-03-13             3
14  b 1 2021-03-13             4
15  b 1 2021-03-13             5
16  b 1 2021-03-13             6

使用 [.] 将其写入管道以表示输入数据帧有效。

dat %>% { 
  sqldf("select a.*, count(b.date) as newvar
    from [.] a left join [.] b
      on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
    group by a.rowid")
  }

这在问题数据上的运行速度大约是 sapply 的两倍。

library(microbenchmark)
microbenchmark(
  sqldf = sqldf("select a.*, count(b.date) as newvar
    from dat a left join dat b
    on a.id = b.id and b.date between a.date - 29 and a.date and b.rowid <= a.rowid
    group by a.rowid"),
  sapply = dat %>% 
    group_by(id) %>% 
    mutate(newvar = sapply(seq(length(date)), 
                         function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
)

给予:

Unit: milliseconds
   expr     min       lq     mean  median       uq      max neval cld
  sqldf 26.2768 26.77340 27.97039 27.0082 27.29515  63.1032   100  a 
 sapply 42.8800 43.69345 48.53094 44.1089 45.25275 285.4861   100   b

使用 sapplybetween,计算 30 天内在当前观察之前的观察次数。

library(lubridate)
library(dplyr)
dat %>% 
  group_by(id) %>% 
  mutate(newvar = sapply(seq(length(date)), 
                         function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))

# A tibble: 16 x 4
# Groups:   id [2]
   id        q date       newvar
   <chr> <dbl> <date>      <int>
 1 a         1 2021-01-01      1
 2 a         1 2021-01-01      2
 3 a         1 2021-01-21      3
 4 a         1 2021-01-21      4
 5 a         1 2021-02-12      3
 6 a         1 2021-02-12      4
 7 a         1 2021-02-12      5
 8 a         1 2021-02-12      6
 9 b         1 2021-02-02      1
10 b         1 2021-02-02      2
11 b         1 2021-02-22      3
12 b         1 2021-02-22      4
13 b         1 2021-03-13      3
14 b         1 2021-03-13      4
15 b         1 2021-03-13      5
16 b         1 2021-03-13      6