NA 序列的最小值和最大值
Min and max of NA sequences
我有一个数据框,其中 foo
列包含 运行 个 NA 值序列。例如:
> test
id foo time
1 1 <NA> 2018-11-19 00:00:48
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50
11 2 <NA> 2018-11-19 01:40:43
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46
25 3 <NA> 2018-11-19 04:00:46
例如,我的 objective 是标记每个序列从 id
和 time
开始的位置 - 上面的数据集会有一个名为 index
的额外列,它标记在哪里这些 NA 值的开始和结束是。然而,id
系列中的最后一个 NA 应该被忽略,单个 NA 值将被标记为 "both"。例如:
> test
id foo time index
1 1 <NA> 2018-11-19 00:00:48 na_starts
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02 na_ends
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07 na_starts
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50 na_starts
11 2 <NA> 2018-11-19 01:40:43 na_ends
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51 na_starts
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42 na_starts
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10 na_ends
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46 both
25 3 x 2018-11-19 04:00:46
如何使用 rle
或 R 中的类似函数实现这一目标?
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA,
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA,
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648,
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275,
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862,
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460,
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-25L), class = "data.frame")
也许这会奏效?我不完全确定 time
与问题有什么关系,除了我认为你希望它按 id
和 time
.
排序
library("tidyverse") -25L), class = "data.frame")
test = test %>%
arrange(id, time) %>%
mutate(miss = is.na(foo))
# This will make the index column for a single run
mark_ends = function(n, miss){
if(!miss){
rep("", times = n)
}
else{
if(n == 1){"both"}
else(c("na_starts", rep("", times = (n-2)), "na_ends"))}
}
# This will use mark_ends across a single ID
mark_index = function(id){
runs = test$miss[test$id == id] %>%
rle
result = Map(f = mark_ends, n = runs$lengths, miss = runs$values) %>%
reduce(.f = c)
result[length(result)] = ""
result
}
# use the function on each id, combine, and put it in test
test$index = unique(test$id) %>%
map(mark_index) %>%
reduce(.f = c)
使用 tidyverse
和 data.table
你可以:
df %>%
rowid_to_column() %>%
group_by(id, temp = rleid(foo)) %>%
mutate(temp2 = seq_along(temp),
index = ifelse(is.na(foo) & temp2 == min(temp2) & temp2 == max(temp2), paste0("both"),
ifelse(is.na(foo) & temp2 == min(temp2), paste0("na_starts"),
ifelse(is.na(foo) & temp2 == max(temp2), paste0("na_ends"), NA)))) %>%
group_by(id) %>%
mutate(index = ifelse(rowid == max(rowid[is.na(foo) & max(temp) & max(temp2)]) &
is.na(lag(foo)), NA, index)) %>%
select(-temp, -temp2, -rowid)
id foo time index
<dbl> <chr> <dttm> <chr>
1 1. <NA> 2018-11-19 00:00:48 na_starts
2 1. <NA> 2018-11-19 00:10:51 <NA>
3 1. <NA> 2018-11-19 00:21:15 <NA>
4 1. <NA> 2018-11-19 00:31:02 na_ends
5 1. x 2018-11-19 00:40:59 <NA>
6 1. x 2018-11-19 00:50:49 <NA>
7 1. x 2018-11-19 01:01:15 <NA>
8 1. <NA> 2018-11-19 01:11:07 na_starts
9 1. <NA> 2018-11-19 01:20:49 <NA>
10 2. <NA> 2018-11-19 01:30:50 na_starts
11 2. <NA> 2018-11-19 01:40:43 na_ends
12 2. x 2018-11-19 01:50:46 <NA>
13 2. x 2018-11-19 02:01:02 <NA>
14 2. x 2018-11-19 02:10:44 <NA>
15 2. <NA> 2018-11-19 02:20:51 na_starts
16 2. <NA> 2018-11-19 02:31:06 <NA>
17 2. <NA> 2018-11-19 02:40:42 <NA>
18 2. <NA> 2018-11-19 02:50:45 <NA>
19 3. <NA> 2018-11-19 03:01:00 na_starts
20 3. <NA> 2018-11-19 03:10:42 <NA>
21 3. <NA> 2018-11-19 03:21:10 <NA>
22 3. <NA> 2018-11-19 03:31:10 na_ends
23 3. x 2018-11-19 03:40:44 <NA>
24 3. <NA> 2018-11-19 03:50:46 both
25 3. x 2018-11-19 04:00:46 <NA>
首先,它正在创建一个唯一的行 ID。其次,它按 "id" 和 "foo" 的 运行 长度分组。第三,它围绕 "foo" 的 运行 长度进行排序。第四,它使用给定条件创建 "index" 变量。然后,它按 "id" 分组并将 NA 分配给每个 id 缺失的 "foo" 序列的最后一行。最后,它删除了冗余变量。
使用data.table的可能解决方案:
library(data.table)
setDT(test)
ind <- test[, .(ri = unique(.I[c(1,.N)][all(is.na(foo))]))
, by = .(id, rl = rleid(is.na(foo)))
][, index := list("both",c("na_starts","na_ends"))[[1 + (.N > 1)]]
, by = .(id, rl)][]
test[ind$ri, index := ind$index
][test[, .I[.N], by = id]$V1, index := NA][]
给出:
> test
id foo time index
1: 1 <NA> 2018-11-19 00:00:48 na_starts
2: 1 <NA> 2018-11-19 00:10:51 <NA>
3: 1 <NA> 2018-11-19 00:21:15 <NA>
4: 1 <NA> 2018-11-19 00:31:02 na_ends
5: 1 x 2018-11-19 00:40:59 <NA>
6: 1 x 2018-11-19 00:50:49 <NA>
7: 1 x 2018-11-19 01:01:15 <NA>
8: 1 <NA> 2018-11-19 01:11:07 na_starts
9: 1 <NA> 2018-11-19 01:20:49 <NA>
10: 2 <NA> 2018-11-19 01:30:50 na_starts
11: 2 <NA> 2018-11-19 01:40:43 na_ends
12: 2 x 2018-11-19 01:50:46 <NA>
13: 2 x 2018-11-19 02:01:02 <NA>
14: 2 x 2018-11-19 02:10:44 <NA>
15: 2 <NA> 2018-11-19 02:20:51 na_starts
16: 2 <NA> 2018-11-19 02:31:06 <NA>
17: 2 <NA> 2018-11-19 02:40:42 <NA>
18: 2 <NA> 2018-11-19 02:50:45 <NA>
19: 3 <NA> 2018-11-19 03:01:00 na_starts
20: 3 <NA> 2018-11-19 03:10:42 <NA>
21: 3 <NA> 2018-11-19 03:21:10 <NA>
22: 3 <NA> 2018-11-19 03:31:10 na_ends
23: 3 x 2018-11-19 03:40:44 <NA>
24: 3 <NA> 2018-11-19 03:50:46 both
25: 3 x 2018-11-19 04:00:46 <NA>
我有一个数据框,其中 foo
列包含 运行 个 NA 值序列。例如:
> test
id foo time
1 1 <NA> 2018-11-19 00:00:48
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50
11 2 <NA> 2018-11-19 01:40:43
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46
25 3 <NA> 2018-11-19 04:00:46
例如,我的 objective 是标记每个序列从 id
和 time
开始的位置 - 上面的数据集会有一个名为 index
的额外列,它标记在哪里这些 NA 值的开始和结束是。然而,id
系列中的最后一个 NA 应该被忽略,单个 NA 值将被标记为 "both"。例如:
> test
id foo time index
1 1 <NA> 2018-11-19 00:00:48 na_starts
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02 na_ends
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07 na_starts
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50 na_starts
11 2 <NA> 2018-11-19 01:40:43 na_ends
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51 na_starts
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42 na_starts
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10 na_ends
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46 both
25 3 x 2018-11-19 04:00:46
如何使用 rle
或 R 中的类似函数实现这一目标?
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA,
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA,
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648,
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275,
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862,
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460,
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-25L), class = "data.frame")
也许这会奏效?我不完全确定 time
与问题有什么关系,除了我认为你希望它按 id
和 time
.
library("tidyverse") -25L), class = "data.frame")
test = test %>%
arrange(id, time) %>%
mutate(miss = is.na(foo))
# This will make the index column for a single run
mark_ends = function(n, miss){
if(!miss){
rep("", times = n)
}
else{
if(n == 1){"both"}
else(c("na_starts", rep("", times = (n-2)), "na_ends"))}
}
# This will use mark_ends across a single ID
mark_index = function(id){
runs = test$miss[test$id == id] %>%
rle
result = Map(f = mark_ends, n = runs$lengths, miss = runs$values) %>%
reduce(.f = c)
result[length(result)] = ""
result
}
# use the function on each id, combine, and put it in test
test$index = unique(test$id) %>%
map(mark_index) %>%
reduce(.f = c)
使用 tidyverse
和 data.table
你可以:
df %>%
rowid_to_column() %>%
group_by(id, temp = rleid(foo)) %>%
mutate(temp2 = seq_along(temp),
index = ifelse(is.na(foo) & temp2 == min(temp2) & temp2 == max(temp2), paste0("both"),
ifelse(is.na(foo) & temp2 == min(temp2), paste0("na_starts"),
ifelse(is.na(foo) & temp2 == max(temp2), paste0("na_ends"), NA)))) %>%
group_by(id) %>%
mutate(index = ifelse(rowid == max(rowid[is.na(foo) & max(temp) & max(temp2)]) &
is.na(lag(foo)), NA, index)) %>%
select(-temp, -temp2, -rowid)
id foo time index
<dbl> <chr> <dttm> <chr>
1 1. <NA> 2018-11-19 00:00:48 na_starts
2 1. <NA> 2018-11-19 00:10:51 <NA>
3 1. <NA> 2018-11-19 00:21:15 <NA>
4 1. <NA> 2018-11-19 00:31:02 na_ends
5 1. x 2018-11-19 00:40:59 <NA>
6 1. x 2018-11-19 00:50:49 <NA>
7 1. x 2018-11-19 01:01:15 <NA>
8 1. <NA> 2018-11-19 01:11:07 na_starts
9 1. <NA> 2018-11-19 01:20:49 <NA>
10 2. <NA> 2018-11-19 01:30:50 na_starts
11 2. <NA> 2018-11-19 01:40:43 na_ends
12 2. x 2018-11-19 01:50:46 <NA>
13 2. x 2018-11-19 02:01:02 <NA>
14 2. x 2018-11-19 02:10:44 <NA>
15 2. <NA> 2018-11-19 02:20:51 na_starts
16 2. <NA> 2018-11-19 02:31:06 <NA>
17 2. <NA> 2018-11-19 02:40:42 <NA>
18 2. <NA> 2018-11-19 02:50:45 <NA>
19 3. <NA> 2018-11-19 03:01:00 na_starts
20 3. <NA> 2018-11-19 03:10:42 <NA>
21 3. <NA> 2018-11-19 03:21:10 <NA>
22 3. <NA> 2018-11-19 03:31:10 na_ends
23 3. x 2018-11-19 03:40:44 <NA>
24 3. <NA> 2018-11-19 03:50:46 both
25 3. x 2018-11-19 04:00:46 <NA>
首先,它正在创建一个唯一的行 ID。其次,它按 "id" 和 "foo" 的 运行 长度分组。第三,它围绕 "foo" 的 运行 长度进行排序。第四,它使用给定条件创建 "index" 变量。然后,它按 "id" 分组并将 NA 分配给每个 id 缺失的 "foo" 序列的最后一行。最后,它删除了冗余变量。
使用data.table的可能解决方案:
library(data.table)
setDT(test)
ind <- test[, .(ri = unique(.I[c(1,.N)][all(is.na(foo))]))
, by = .(id, rl = rleid(is.na(foo)))
][, index := list("both",c("na_starts","na_ends"))[[1 + (.N > 1)]]
, by = .(id, rl)][]
test[ind$ri, index := ind$index
][test[, .I[.N], by = id]$V1, index := NA][]
给出:
> test id foo time index 1: 1 <NA> 2018-11-19 00:00:48 na_starts 2: 1 <NA> 2018-11-19 00:10:51 <NA> 3: 1 <NA> 2018-11-19 00:21:15 <NA> 4: 1 <NA> 2018-11-19 00:31:02 na_ends 5: 1 x 2018-11-19 00:40:59 <NA> 6: 1 x 2018-11-19 00:50:49 <NA> 7: 1 x 2018-11-19 01:01:15 <NA> 8: 1 <NA> 2018-11-19 01:11:07 na_starts 9: 1 <NA> 2018-11-19 01:20:49 <NA> 10: 2 <NA> 2018-11-19 01:30:50 na_starts 11: 2 <NA> 2018-11-19 01:40:43 na_ends 12: 2 x 2018-11-19 01:50:46 <NA> 13: 2 x 2018-11-19 02:01:02 <NA> 14: 2 x 2018-11-19 02:10:44 <NA> 15: 2 <NA> 2018-11-19 02:20:51 na_starts 16: 2 <NA> 2018-11-19 02:31:06 <NA> 17: 2 <NA> 2018-11-19 02:40:42 <NA> 18: 2 <NA> 2018-11-19 02:50:45 <NA> 19: 3 <NA> 2018-11-19 03:01:00 na_starts 20: 3 <NA> 2018-11-19 03:10:42 <NA> 21: 3 <NA> 2018-11-19 03:21:10 <NA> 22: 3 <NA> 2018-11-19 03:31:10 na_ends 23: 3 x 2018-11-19 03:40:44 <NA> 24: 3 <NA> 2018-11-19 03:50:46 both 25: 3 x 2018-11-19 04:00:46 <NA>