创建连续序列的数量
create number of consecutive sequences
我正在使用 R 分析时间序列。我的目标是从“响应”开始计算连续序列。我想添加一个列,根据列响应中的连续序列对我的数据进行分类。示例:第 1 行是 ID“A”的第 1 组,第 3 行是 ID“A”的第 2 组,第 6 到 9 行是 ID“A”的第 3 组。我想要的结果显示在“want_group”中。数据具有以下结构:
"row" "date" "id" "response" "want_group"
1 2021-10-06 "A" 1 1
2 2021-10-07 "A" 0 0
3 2021-10-08 "A" 1 2
4 2021-10-09 "A" 0 0
5 2021-10-10 "A" 0 0
6 2021-10-11 "A" 1 3
7 2021-10-12 "A" 1 3
8 2021-10-13 "A" 1 3
9 2021-10-14 "A" 1 3
10 2021-10-15 "A" 0 0
11 2021-10-16 "A" 1 4
12 2021-10-17 "A" 0 0
13 2021-10-18 "A" 0 0
14 2021-10-06 "B" 0 0
15 2021-10-07 "B" 0 0
16 2021-10-08 "B" 0 0
17 2021-10-09 "B" 1 1
18 2021-10-10 "B" 1 1
19 2021-10-11 "B" 0 0
20 2021-10-12 "B" 0 0
21 2021-10-13 "B" 0 0
22 2021-10-14 "B" 0 0
23 2021-10-15 "B" 0 0
24 2021-10-16 "B" 1 2
25 2021-10-17 "B" 1 2
26 2021-10-18 "B" 1 2
我的想法是对数据帧进行分组并计算变量响应的累积总和,使其具有与 length of longest consecutive elements of sequence 类似的结构,这样我就可以在 cs_res=1 中得到第 3 行和cs_res=1,2,3,4 中的 6 到 9。但是 cumsum 是针对孔 ID 计算的。我希望你能给我一些提示,让我在 R 中找到一个函数,或者我如何找到一个解决方案。
df1 <- data.frame(row = c(1:13),
date = seq.Date(as.Date("2021-10-06"), as.Date("2021-10-18"), "day"),
id = rep("A", times = 13),
response = c(1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0),
want_group = c(1, 0, 2, 0, 0, 3, 3, 3, 3, 0, 4, 0, 0) )
df2 <- data.frame(row = c(14:26),
date = seq.Date(as.Date("2021-10-06"), as.Date("2021-10-18"), "day"),
id = rep("B", times = 13),
response = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1),
want_group = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2) )
df <- rbind(df1, df2)
df %>%
group_by(id, response) %>%
mutate(
cs_res = if_else(response == 1L, sequence(rle(response)$lengths), 0L)
)
"row" "id" "response" "cs_res"
1 "A" 1 1
2 "A" 0 0
3 "A" 1 2
4 "A" 0 0
5 "A" 0 0
6 "A" 1 3
7 "A" 1 4
8 "A" 1 5
9 "A" 1 6
10 "A" 0 0
11 "A" 1 7
12 "A" 0 0
13 "A" 0 0
14 "B" 0 0
15 "B" 0 0
.
.
.
这是一个使用 dplyr
和 tidyr
的非常老套的解决方案:
df <- df %>% group_by(id) %>%
mutate(lag_res=lag(response,default=0),
first = ifelse(lag_res == 0 & response == 1,1,0),
want_group = case_when(first == 1 ~ cumsum(first),
response == 0 ~ 0,
TRUE ~ NA_real_)) %>%
fill(want_group) %>% select(-lag_res,-first) %>%
print(n=26) %>% ungroup()
# A tibble: 26 x 5
# Groups: id [2]
row date id response want_group
<int> <date> <chr> <dbl> <dbl>
1 1 2021-10-06 A 1 1
2 2 2021-10-07 A 0 0
3 3 2021-10-08 A 1 2
4 4 2021-10-09 A 0 0
5 5 2021-10-10 A 0 0
6 6 2021-10-11 A 1 3
7 7 2021-10-12 A 1 3
8 8 2021-10-13 A 1 3
9 9 2021-10-14 A 1 3
10 10 2021-10-15 A 0 0
11 11 2021-10-16 A 1 4
12 12 2021-10-17 A 0 0
13 13 2021-10-18 A 0 0
14 14 2021-10-06 B 0 0
15 15 2021-10-07 B 0 0
16 16 2021-10-08 B 0 0
17 17 2021-10-09 B 1 1
18 18 2021-10-10 B 1 1
19 19 2021-10-11 B 0 0
20 20 2021-10-12 B 0 0
21 21 2021-10-13 B 0 0
22 22 2021-10-14 B 0 0
23 23 2021-10-15 B 0 0
24 24 2021-10-16 B 1 2
25 25 2021-10-17 B 1 2
26 26 2021-10-18 B 1 2
然后,要获得 cs_res,您可以这样做:
df %>% group_by(id,want_group) %>%
mutate(cs_res = cumsum(response))
# A tibble: 26 x 6
# Groups: id, want_group [8]
row date id response want_group cs_res
<int> <date> <chr> <dbl> <dbl> <dbl>
1 1 2021-10-06 A 1 1 1
2 2 2021-10-07 A 0 0 0
3 3 2021-10-08 A 1 2 1
4 4 2021-10-09 A 0 0 0
5 5 2021-10-10 A 0 0 0
6 6 2021-10-11 A 1 3 1
7 7 2021-10-12 A 1 3 2
8 8 2021-10-13 A 1 3 3
9 9 2021-10-14 A 1 3 4
10 10 2021-10-15 A 0 0 0
我正在使用 R 分析时间序列。我的目标是从“响应”开始计算连续序列。我想添加一个列,根据列响应中的连续序列对我的数据进行分类。示例:第 1 行是 ID“A”的第 1 组,第 3 行是 ID“A”的第 2 组,第 6 到 9 行是 ID“A”的第 3 组。我想要的结果显示在“want_group”中。数据具有以下结构:
"row" "date" "id" "response" "want_group"
1 2021-10-06 "A" 1 1
2 2021-10-07 "A" 0 0
3 2021-10-08 "A" 1 2
4 2021-10-09 "A" 0 0
5 2021-10-10 "A" 0 0
6 2021-10-11 "A" 1 3
7 2021-10-12 "A" 1 3
8 2021-10-13 "A" 1 3
9 2021-10-14 "A" 1 3
10 2021-10-15 "A" 0 0
11 2021-10-16 "A" 1 4
12 2021-10-17 "A" 0 0
13 2021-10-18 "A" 0 0
14 2021-10-06 "B" 0 0
15 2021-10-07 "B" 0 0
16 2021-10-08 "B" 0 0
17 2021-10-09 "B" 1 1
18 2021-10-10 "B" 1 1
19 2021-10-11 "B" 0 0
20 2021-10-12 "B" 0 0
21 2021-10-13 "B" 0 0
22 2021-10-14 "B" 0 0
23 2021-10-15 "B" 0 0
24 2021-10-16 "B" 1 2
25 2021-10-17 "B" 1 2
26 2021-10-18 "B" 1 2
我的想法是对数据帧进行分组并计算变量响应的累积总和,使其具有与 length of longest consecutive elements of sequence 类似的结构,这样我就可以在 cs_res=1 中得到第 3 行和cs_res=1,2,3,4 中的 6 到 9。但是 cumsum 是针对孔 ID 计算的。我希望你能给我一些提示,让我在 R 中找到一个函数,或者我如何找到一个解决方案。
df1 <- data.frame(row = c(1:13),
date = seq.Date(as.Date("2021-10-06"), as.Date("2021-10-18"), "day"),
id = rep("A", times = 13),
response = c(1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0),
want_group = c(1, 0, 2, 0, 0, 3, 3, 3, 3, 0, 4, 0, 0) )
df2 <- data.frame(row = c(14:26),
date = seq.Date(as.Date("2021-10-06"), as.Date("2021-10-18"), "day"),
id = rep("B", times = 13),
response = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1),
want_group = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2) )
df <- rbind(df1, df2)
df %>%
group_by(id, response) %>%
mutate(
cs_res = if_else(response == 1L, sequence(rle(response)$lengths), 0L)
)
"row" "id" "response" "cs_res"
1 "A" 1 1
2 "A" 0 0
3 "A" 1 2
4 "A" 0 0
5 "A" 0 0
6 "A" 1 3
7 "A" 1 4
8 "A" 1 5
9 "A" 1 6
10 "A" 0 0
11 "A" 1 7
12 "A" 0 0
13 "A" 0 0
14 "B" 0 0
15 "B" 0 0
.
.
.
这是一个使用 dplyr
和 tidyr
的非常老套的解决方案:
df <- df %>% group_by(id) %>%
mutate(lag_res=lag(response,default=0),
first = ifelse(lag_res == 0 & response == 1,1,0),
want_group = case_when(first == 1 ~ cumsum(first),
response == 0 ~ 0,
TRUE ~ NA_real_)) %>%
fill(want_group) %>% select(-lag_res,-first) %>%
print(n=26) %>% ungroup()
# A tibble: 26 x 5
# Groups: id [2]
row date id response want_group
<int> <date> <chr> <dbl> <dbl>
1 1 2021-10-06 A 1 1
2 2 2021-10-07 A 0 0
3 3 2021-10-08 A 1 2
4 4 2021-10-09 A 0 0
5 5 2021-10-10 A 0 0
6 6 2021-10-11 A 1 3
7 7 2021-10-12 A 1 3
8 8 2021-10-13 A 1 3
9 9 2021-10-14 A 1 3
10 10 2021-10-15 A 0 0
11 11 2021-10-16 A 1 4
12 12 2021-10-17 A 0 0
13 13 2021-10-18 A 0 0
14 14 2021-10-06 B 0 0
15 15 2021-10-07 B 0 0
16 16 2021-10-08 B 0 0
17 17 2021-10-09 B 1 1
18 18 2021-10-10 B 1 1
19 19 2021-10-11 B 0 0
20 20 2021-10-12 B 0 0
21 21 2021-10-13 B 0 0
22 22 2021-10-14 B 0 0
23 23 2021-10-15 B 0 0
24 24 2021-10-16 B 1 2
25 25 2021-10-17 B 1 2
26 26 2021-10-18 B 1 2
然后,要获得 cs_res,您可以这样做:
df %>% group_by(id,want_group) %>%
mutate(cs_res = cumsum(response))
# A tibble: 26 x 6
# Groups: id, want_group [8]
row date id response want_group cs_res
<int> <date> <chr> <dbl> <dbl> <dbl>
1 1 2021-10-06 A 1 1 1
2 2 2021-10-07 A 0 0 0
3 3 2021-10-08 A 1 2 1
4 4 2021-10-09 A 0 0 0
5 5 2021-10-10 A 0 0 0
6 6 2021-10-11 A 1 3 1
7 7 2021-10-12 A 1 3 2
8 8 2021-10-13 A 1 3 3
9 9 2021-10-14 A 1 3 4
10 10 2021-10-15 A 0 0 0