如何在 R 中按组填充数据框中的空白?
How can I fill in gaps in a data frame by group in R?
我有一个包含时间序列、一组事件和一个 ID 的数据框。我通过共享时间列连接了两个数据框,但我想填补事件中的空白。
这是一个简化的示例,其中包含我用来连接数据框的值列,然后是事件和 ID 列。我想以某种方式在第一个(开始)和最后一个(结束)事件条目之间填写事件和 ID 列。
a<-c(seq(from = 150, to = 213, by = 3))
b<-c("start","","","mid","","end","", "",
"start", "", "", "end", "", "", "",
"start", "", "end", "start", "mid", "", "end")
c<-c("A","","","A","","A","", "",
"A", "", "", "A", "", "", "",
"B", "", "B", "B", "B", "", "B")
(data<-data.frame(value = a, event = b, ID = c))
这里是目标,开始和结束之间填写了事件和ID:
agoal<-c(seq(from = 150, to = 213, by = 3))
bgoal<-c("start","start","start","mid","mid","end","", "",
"start", "start", "start", "end", "", "", "",
"start", "start", "end", "start", "mid", "mid", "end")
cgoal<-c("A","A","A","A","A","A","", "",
"A", "A", "A", "A", "", "", "",
"B", "B", "B", "B", "B", "B", "B")
(goal<-data.frame(value = agoal, event = bgoal, ID = cgoal))
您可以使用 dplyr
和 tidyr
完成此任务:
library(tidyr)
library(dplyr)
data %>%
mutate(grp = cumsum(case_when(event == "end" ~ -1,
event == "start" ~ 1,
TRUE ~ 0)),
across(c(-value, -grp), ~ ifelse(.x == "" & grp == 1, NA_character_, .x))) %>%
fill(c(-value), .direction="down") %>%
select(-grp)
returns
value event ID
1 150 start A
2 153 start A
3 156 start A
4 159 mid A
5 162 mid A
6 165 end A
7 168
8 171
9 174 start A
10 177 start A
11 180 start A
12 183 end A
13 186
14 189
15 192
16 195 start B
17 198 start B
18 201 end B
19 204 start B
20 207 mid B
21 210 mid B
22 213 end B
这是另一个选项data.table
- 将
data.frame
转换为 data.table
- setDT
- 遍历
.SDcols
- 'event'、'ID' 中指定的列
- 将空格 (
""
) 替换为 NA
- na_if
- 使用
na.locf0
(来自 zoo
)用之前的非 NA 填充 NA 元素并将(:=
)分配回列
- 获取行索引 (
.I
),其中 'event' 值为 duplicated
,其中 'event' 为按 运行- 分组的“结束”事件的长度 ID (rleid
)
- 提取行索引(
$V1
)并将那些'event'、'ID'分配给空白
library(data.table)
library(zoo)
library(dplyr)
setDT(data)[, c("event", "ID") := lapply(.SD, function(x)
na.locf0(na_if(x, ""))), .SDcols = event:ID]
data[data[, .I[duplicated(event) & event == "end"] ,
rleid(event)]$V1, c("event", "ID") := .("", "")]
-输出
data
value event ID
1: 150 start A
2: 153 start A
3: 156 start A
4: 159 mid A
5: 162 mid A
6: 165 end A
7: 168
8: 171
9: 174 start A
10: 177 start A
11: 180 start A
12: 183 end A
13: 186
14: 189
15: 192
16: 195 start B
17: 198 start B
18: 201 end B
19: 204 start B
20: 207 mid B
21: 210 mid B
22: 213 end B
我有一个包含时间序列、一组事件和一个 ID 的数据框。我通过共享时间列连接了两个数据框,但我想填补事件中的空白。
这是一个简化的示例,其中包含我用来连接数据框的值列,然后是事件和 ID 列。我想以某种方式在第一个(开始)和最后一个(结束)事件条目之间填写事件和 ID 列。
a<-c(seq(from = 150, to = 213, by = 3))
b<-c("start","","","mid","","end","", "",
"start", "", "", "end", "", "", "",
"start", "", "end", "start", "mid", "", "end")
c<-c("A","","","A","","A","", "",
"A", "", "", "A", "", "", "",
"B", "", "B", "B", "B", "", "B")
(data<-data.frame(value = a, event = b, ID = c))
这里是目标,开始和结束之间填写了事件和ID:
agoal<-c(seq(from = 150, to = 213, by = 3))
bgoal<-c("start","start","start","mid","mid","end","", "",
"start", "start", "start", "end", "", "", "",
"start", "start", "end", "start", "mid", "mid", "end")
cgoal<-c("A","A","A","A","A","A","", "",
"A", "A", "A", "A", "", "", "",
"B", "B", "B", "B", "B", "B", "B")
(goal<-data.frame(value = agoal, event = bgoal, ID = cgoal))
您可以使用 dplyr
和 tidyr
完成此任务:
library(tidyr)
library(dplyr)
data %>%
mutate(grp = cumsum(case_when(event == "end" ~ -1,
event == "start" ~ 1,
TRUE ~ 0)),
across(c(-value, -grp), ~ ifelse(.x == "" & grp == 1, NA_character_, .x))) %>%
fill(c(-value), .direction="down") %>%
select(-grp)
returns
value event ID
1 150 start A
2 153 start A
3 156 start A
4 159 mid A
5 162 mid A
6 165 end A
7 168
8 171
9 174 start A
10 177 start A
11 180 start A
12 183 end A
13 186
14 189
15 192
16 195 start B
17 198 start B
18 201 end B
19 204 start B
20 207 mid B
21 210 mid B
22 213 end B
这是另一个选项data.table
- 将
data.frame
转换为data.table
-setDT
- 遍历
.SDcols
- 'event'、'ID' 中指定的列
- 将空格 (
""
) 替换为NA
-na_if
- 使用
na.locf0
(来自zoo
)用之前的非 NA 填充 NA 元素并将(:=
)分配回列 - 获取行索引 (
.I
),其中 'event' 值为duplicated
,其中 'event' 为按 运行- 分组的“结束”事件的长度 ID (rleid
) - 提取行索引(
$V1
)并将那些'event'、'ID'分配给空白
library(data.table)
library(zoo)
library(dplyr)
setDT(data)[, c("event", "ID") := lapply(.SD, function(x)
na.locf0(na_if(x, ""))), .SDcols = event:ID]
data[data[, .I[duplicated(event) & event == "end"] ,
rleid(event)]$V1, c("event", "ID") := .("", "")]
-输出
data
value event ID
1: 150 start A
2: 153 start A
3: 156 start A
4: 159 mid A
5: 162 mid A
6: 165 end A
7: 168
8: 171
9: 174 start A
10: 177 start A
11: 180 start A
12: 183 end A
13: 186
14: 189
15: 192
16: 195 start B
17: 198 start B
18: 201 end B
19: 204 start B
20: 207 mid B
21: 210 mid B
22: 213 end B