创建连续序列的数量

create number of consecutive sequences

我正在使用 R 分析时间序列。我的目标是从“响应”开始计算连续序列。我想添加一个列,根据列响应中的连续序列对我的数据进行分类。示例:第 1 行是 ID“A”的第 1 组,第 3 行是 ID“A”的第 2 组,第 6 到 9 行是 ID“A”的第 3 组。我想要的结果显示在“want_group”中。数据具有以下结构:

"row"   "date"  "id"    "response"  "want_group"
1   2021-10-06  "A" 1   1
2   2021-10-07  "A" 0   0
3   2021-10-08  "A" 1   2
4   2021-10-09  "A" 0   0
5   2021-10-10  "A" 0   0
6   2021-10-11  "A" 1   3
7   2021-10-12  "A" 1   3
8   2021-10-13  "A" 1   3
9   2021-10-14  "A" 1   3
10  2021-10-15  "A" 0   0
11  2021-10-16  "A" 1   4
12  2021-10-17  "A" 0   0
13  2021-10-18  "A" 0   0
14  2021-10-06  "B" 0   0
15  2021-10-07  "B" 0   0
16  2021-10-08  "B" 0   0
17  2021-10-09  "B" 1   1
18  2021-10-10  "B" 1   1
19  2021-10-11  "B" 0   0
20  2021-10-12  "B" 0   0
21  2021-10-13  "B" 0   0
22  2021-10-14  "B" 0   0
23  2021-10-15  "B" 0   0
24  2021-10-16  "B" 1   2
25  2021-10-17  "B" 1   2
26  2021-10-18  "B" 1   2

我的想法是对数据帧进行分组并计算变量响应的累积总和,使其具有与 length of longest consecutive elements of sequence 类似的结构,这样我就可以在 cs_res=1 中得到第 3 行和cs_res=1,2,3,4 中的 6 到 9。但是 cumsum 是针对孔 ID 计算的。我希望你能给我一些提示,让我在 R 中找到一个函数,或者我如何找到一个解决方案。

df1 <- data.frame(row = c(1:13),
                  date = seq.Date(as.Date("2021-10-06"), as.Date("2021-10-18"), "day"),
                  id = rep("A", times = 13),
                  response = c(1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0),
                  want_group = c(1, 0, 2, 0, 0, 3, 3, 3, 3, 0, 4, 0, 0) )
df2 <- data.frame(row = c(14:26),
                  date = seq.Date(as.Date("2021-10-06"), as.Date("2021-10-18"), "day"),
                  id = rep("B", times = 13),
                  response = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1),
                  want_group = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2) ) 

df <- rbind(df1, df2)

df %>% 
  group_by(id, response) %>% 
  mutate(
    cs_res = if_else(response ==  1L, sequence(rle(response)$lengths), 0L) 
    )

"row"   "id"    "response"  "cs_res"
1   "A" 1   1
2   "A" 0   0
3   "A" 1   2
4   "A" 0   0
5   "A" 0   0
6   "A" 1   3
7   "A" 1   4
8   "A" 1   5
9   "A" 1   6
10  "A" 0   0
11  "A" 1   7
12  "A" 0   0
13  "A" 0   0
14  "B" 0   0
15  "B" 0   0
.
.
.

这是一个使用 dplyrtidyr 的非常老套的解决方案:

  df <- df %>% group_by(id) %>% 
  mutate(lag_res=lag(response,default=0),
         first = ifelse(lag_res == 0 & response == 1,1,0),
         want_group = case_when(first == 1 ~ cumsum(first),
                                response == 0 ~ 0,
                                TRUE ~ NA_real_)) %>% 
  fill(want_group) %>% select(-lag_res,-first) %>% 
  print(n=26) %>% ungroup()

# A tibble: 26 x 5
# Groups:   id [2]
     row date       id    response want_group
   <int> <date>     <chr>    <dbl>      <dbl>
 1     1 2021-10-06 A            1          1
 2     2 2021-10-07 A            0          0
 3     3 2021-10-08 A            1          2
 4     4 2021-10-09 A            0          0
 5     5 2021-10-10 A            0          0
 6     6 2021-10-11 A            1          3
 7     7 2021-10-12 A            1          3
 8     8 2021-10-13 A            1          3
 9     9 2021-10-14 A            1          3
10    10 2021-10-15 A            0          0
11    11 2021-10-16 A            1          4
12    12 2021-10-17 A            0          0
13    13 2021-10-18 A            0          0
14    14 2021-10-06 B            0          0
15    15 2021-10-07 B            0          0
16    16 2021-10-08 B            0          0
17    17 2021-10-09 B            1          1
18    18 2021-10-10 B            1          1
19    19 2021-10-11 B            0          0
20    20 2021-10-12 B            0          0
21    21 2021-10-13 B            0          0
22    22 2021-10-14 B            0          0
23    23 2021-10-15 B            0          0
24    24 2021-10-16 B            1          2
25    25 2021-10-17 B            1          2
26    26 2021-10-18 B            1          2

然后,要获得 cs_res,您可以这样做:

df %>% group_by(id,want_group) %>% 
   mutate(cs_res = cumsum(response))
# A tibble: 26 x 6
# Groups:   id, want_group [8]
     row date       id    response want_group cs_res
   <int> <date>     <chr>    <dbl>      <dbl>  <dbl>
 1     1 2021-10-06 A            1          1      1
 2     2 2021-10-07 A            0          0      0
 3     3 2021-10-08 A            1          2      1
 4     4 2021-10-09 A            0          0      0
 5     5 2021-10-10 A            0          0      0
 6     6 2021-10-11 A            1          3      1
 7     7 2021-10-12 A            1          3      2
 8     8 2021-10-13 A            1          3      3
 9     9 2021-10-14 A            1          3      4
10    10 2021-10-15 A            0          0      0