相差至少 30 分钟的子集观察
Subset observations that differ by at least 30 minutes time
我有一个 data.table
(约 3000 万行),其中包含一个 POSIXct
格式的 datetime
列、一个 id
列和其他一些列(在例如,我只留下了一个不相关的列 x
来证明还有其他列需要保留)。一个dput
在post的底部。
head(DT)
# datetime x id
#1: 2016-04-28 16:20:18 0.02461368 1
#2: 2016-04-28 16:41:34 0.88953932 1
#3: 2016-04-28 16:46:07 0.31818101 1
#4: 2016-04-28 17:00:56 0.14711365 1
#5: 2016-04-28 17:09:11 0.54406602 1
#6: 2016-04-28 17:39:09 0.69280341 1
问:对于每个 id
,我只需要对时间相差超过 30 分钟的观察结果进行子集化。执行此操作的有效 data.table
方法是什么(如果可能,无需大量循环)?
逻辑也可以描述为(就像我下面的评论):
Per id the first row is always kept. The next row that is at least 30
minutes after the first shall also be kept. Let's assume that row to
be kept is row 4. Then, compute time differences between row 4 and
rows 5:n and keep the first that differs by more than 30 mins and so
on
在下面的 dput 中,我添加了一个列 keep
以指示在此示例中应保留哪些行,因为它们与每个 id 保留的先前观察结果相差超过 30 分钟。困难在于,似乎需要迭代计算时间差(或者至少,我目前想不出更有效的方法)。
library(data.table)
DT <- structure(list(
datetime = structure(c(1461853218.81561, 1461854494.81561,
1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561,
1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561,
1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561,
1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561,
1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")),
x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848,
0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696,
0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558,
0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435,
0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718,
0.655705799115822),
id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)),
.Names = c("datetime", "x", "id", "keep"),
row.names = c(NA, -20L),
class = c("data.table", "data.frame"))
setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]
keep
列的解释:
- 第 2:3 行与第 1 行相差不到 30 分钟 -> 删除
- 第 4 行与第 1 行相差超过 30 分钟 -> 保持
- 第 5 行距离第 4 行不到 30 分钟 -> 删除
- 第 6 行与第 4 行相差超过 30 分钟 -> 保持
- ...
期望的输出:
desiredDT <- DT[(keep)]
感谢三位专家的解答。我在 1 和 1000 万行数据上测试了它们。这是基准测试的摘录。
a) 100 万行
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.286647 1.277104 1.185216 1.267769 1.140614 1.036749 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901 3
# eddi2(DT_Eddi2) 9.966078 9.915651 9.210168 9.866330 8.877769 8.070281 3
b) 1000 万行
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.019561 1.025427 1.026681 1.031061 1.030028 1.029037 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143 3
# eddi2(DT_Eddi2) 9.796800 9.693823 9.526193 9.594931 9.398969 9.211019 3
显然,@Frank 的 data.table 方法和@Roland 基于 Rcpp 的解决方案在性能上相似,Rcpp 略有优势,而 @eddi 的方法仍然很快,但性能不如其他方法。
然而,当我检查解决方案是否相等时,我发现@Roland 的方法与其他方法的结果略有不同:
a) 100 万行
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (982228, 982224) differ"
#[2] "Component “id”: Numeric: lengths (982228, 982224) differ"
#[3] "Component “x”: Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
b) 1000 万行
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (9981898, 9981891) differ"
#[2] "Component “id”: Numeric: lengths (9981898, 9981891) differ"
#[3] "Component “x”: Numeric: lengths (9981898, 9981891) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
我目前的假设是,这种差异可能与差异是 > 30 分钟还是 >= 30 分钟有关,尽管我还不确定。
最后的想法:我决定采用@Frank 的解决方案有两个原因:1. 它的性能非常好,几乎与 Rcpp 解决方案相当,并且 2. 它不需要我不使用的另一个包还很熟悉(反正我用的是data.table)
我会这样做:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
这给出了
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
其背后的想法由 OP :
描述
per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
使用 Rcpp:
library(Rcpp)
library(inline)
cppFunction(
'LogicalVector selecttimes(const NumericVector x) {
const int n = x.length();
LogicalVector res(n);
res(0) = true;
double testval = x(0);
for (int i=1; i<n; i++) {
if (x(i) - testval > 30 * 60) {
testval = x(i);
res(i) = true;
}
}
return res;
}')
DT[, keep1 := selecttimes(datetime), by = id]
DT[, all(keep == keep1)]
#[1] TRUE
需要做一些额外的测试,需要输入验证,时间差可以作为参数。
# create an index column
DT[, idx := 1:.N, by = id]
# find the indices of the matching future dates
DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
, idx, roll = -Inf]]
# datetime x id keep difftime idx fut.idx
# 1: 2016-04-28 09:20:18 0.02461368 1 TRUE 0.0000000 mins 1 4
# 2: 2016-04-28 09:41:34 0.88953932 1 FALSE 21.2666667 mins 2 6
# 3: 2016-04-28 09:46:07 0.31818101 1 FALSE 25.8166667 mins 3 6
# 4: 2016-04-28 10:00:56 0.14711365 1 TRUE 40.6333333 mins 4 6
# 5: 2016-04-28 10:09:11 0.54406602 1 FALSE 48.8833333 mins 5 7
# 6: 2016-04-28 10:39:09 0.69280341 1 TRUE 78.8500000 mins 6 NA
# 7: 2016-04-28 10:50:01 0.99426978 1 FALSE 89.7166667 mins 7 NA
# 8: 2016-04-28 10:51:46 0.47779597 1 FALSE 91.4666667 mins 8 NA
# 9: 2016-04-28 10:57:58 0.23162579 1 FALSE 97.6666667 mins 9 NA
#10: 2016-04-28 10:58:23 0.96302423 1 FALSE 98.0833333 mins 10 NA
#11: 2016-04-28 09:13:19 0.21640794 2 TRUE 0.0000000 mins 1 4
#12: 2016-04-28 09:13:44 0.70853047 2 FALSE 0.4166667 mins 2 4
#13: 2016-04-28 09:36:44 0.75845954 2 FALSE 23.4166667 mins 3 6
#14: 2016-04-28 09:55:31 0.64050681 2 TRUE 42.2000000 mins 4 8
#15: 2016-04-28 10:00:33 0.90229905 2 FALSE 47.2333333 mins 5 9
#16: 2016-04-28 10:11:51 0.28915974 2 FALSE 58.5333333 mins 6 9
#17: 2016-04-28 10:14:14 0.79546742 2 FALSE 60.9166667 mins 7 9
#18: 2016-04-28 10:26:17 0.69070528 2 TRUE 72.9666667 mins 8 10
#19: 2016-04-28 10:51:02 0.59414202 2 FALSE 97.7166667 mins 9 NA
#20: 2016-04-28 10:56:36 0.65570580 2 TRUE 103.2833333 mins 10 NA
# at this point the problem is "solved", but you still have to extract the solution
# and that's the more complicated part
DT[, keep.new := FALSE]
# iterate over the matching indices (jumping straight to the correct one)
DT[, {
next.idx = 1
while(!is.na(next.idx)) {
set(DT, .I[next.idx], 'keep.new', TRUE)
next.idx = fut.idx[next.idx]
}
}, by = id]
DT[, identical(keep, keep.new)]
#[1] TRUE
或者对于最后一步,你可以这样做(这将遍历整个事情,但我不知道速度影响会是什么):
DT[, keep.3 := FALSE]
DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
, by = id]$V1
, keep.3 := TRUE]
DT[, identical(keep, keep.3)]
#[1] TRUE
我有一个 data.table
(约 3000 万行),其中包含一个 POSIXct
格式的 datetime
列、一个 id
列和其他一些列(在例如,我只留下了一个不相关的列 x
来证明还有其他列需要保留)。一个dput
在post的底部。
head(DT)
# datetime x id
#1: 2016-04-28 16:20:18 0.02461368 1
#2: 2016-04-28 16:41:34 0.88953932 1
#3: 2016-04-28 16:46:07 0.31818101 1
#4: 2016-04-28 17:00:56 0.14711365 1
#5: 2016-04-28 17:09:11 0.54406602 1
#6: 2016-04-28 17:39:09 0.69280341 1
问:对于每个 id
,我只需要对时间相差超过 30 分钟的观察结果进行子集化。执行此操作的有效 data.table
方法是什么(如果可能,无需大量循环)?
逻辑也可以描述为(就像我下面的评论):
Per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
在下面的 dput 中,我添加了一个列 keep
以指示在此示例中应保留哪些行,因为它们与每个 id 保留的先前观察结果相差超过 30 分钟。困难在于,似乎需要迭代计算时间差(或者至少,我目前想不出更有效的方法)。
library(data.table)
DT <- structure(list(
datetime = structure(c(1461853218.81561, 1461854494.81561,
1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561,
1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561,
1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561,
1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561,
1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")),
x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848,
0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696,
0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558,
0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435,
0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718,
0.655705799115822),
id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)),
.Names = c("datetime", "x", "id", "keep"),
row.names = c(NA, -20L),
class = c("data.table", "data.frame"))
setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]
keep
列的解释:
- 第 2:3 行与第 1 行相差不到 30 分钟 -> 删除
- 第 4 行与第 1 行相差超过 30 分钟 -> 保持
- 第 5 行距离第 4 行不到 30 分钟 -> 删除
- 第 6 行与第 4 行相差超过 30 分钟 -> 保持
- ...
期望的输出:
desiredDT <- DT[(keep)]
感谢三位专家的解答。我在 1 和 1000 万行数据上测试了它们。这是基准测试的摘录。
a) 100 万行
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.286647 1.277104 1.185216 1.267769 1.140614 1.036749 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901 3
# eddi2(DT_Eddi2) 9.966078 9.915651 9.210168 9.866330 8.877769 8.070281 3
b) 1000 万行
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.019561 1.025427 1.026681 1.031061 1.030028 1.029037 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143 3
# eddi2(DT_Eddi2) 9.796800 9.693823 9.526193 9.594931 9.398969 9.211019 3
显然,@Frank 的 data.table 方法和@Roland 基于 Rcpp 的解决方案在性能上相似,Rcpp 略有优势,而 @eddi 的方法仍然很快,但性能不如其他方法。
然而,当我检查解决方案是否相等时,我发现@Roland 的方法与其他方法的结果略有不同:
a) 100 万行
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (982228, 982224) differ"
#[2] "Component “id”: Numeric: lengths (982228, 982224) differ"
#[3] "Component “x”: Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
b) 1000 万行
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (9981898, 9981891) differ"
#[2] "Component “id”: Numeric: lengths (9981898, 9981891) differ"
#[3] "Component “x”: Numeric: lengths (9981898, 9981891) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
我目前的假设是,这种差异可能与差异是 > 30 分钟还是 >= 30 分钟有关,尽管我还不确定。
最后的想法:我决定采用@Frank 的解决方案有两个原因:1. 它的性能非常好,几乎与 Rcpp 解决方案相当,并且 2. 它不需要我不使用的另一个包还很熟悉(反正我用的是data.table)
我会这样做:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
这给出了
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
其背后的想法由 OP
per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
使用 Rcpp:
library(Rcpp)
library(inline)
cppFunction(
'LogicalVector selecttimes(const NumericVector x) {
const int n = x.length();
LogicalVector res(n);
res(0) = true;
double testval = x(0);
for (int i=1; i<n; i++) {
if (x(i) - testval > 30 * 60) {
testval = x(i);
res(i) = true;
}
}
return res;
}')
DT[, keep1 := selecttimes(datetime), by = id]
DT[, all(keep == keep1)]
#[1] TRUE
需要做一些额外的测试,需要输入验证,时间差可以作为参数。
# create an index column
DT[, idx := 1:.N, by = id]
# find the indices of the matching future dates
DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
, idx, roll = -Inf]]
# datetime x id keep difftime idx fut.idx
# 1: 2016-04-28 09:20:18 0.02461368 1 TRUE 0.0000000 mins 1 4
# 2: 2016-04-28 09:41:34 0.88953932 1 FALSE 21.2666667 mins 2 6
# 3: 2016-04-28 09:46:07 0.31818101 1 FALSE 25.8166667 mins 3 6
# 4: 2016-04-28 10:00:56 0.14711365 1 TRUE 40.6333333 mins 4 6
# 5: 2016-04-28 10:09:11 0.54406602 1 FALSE 48.8833333 mins 5 7
# 6: 2016-04-28 10:39:09 0.69280341 1 TRUE 78.8500000 mins 6 NA
# 7: 2016-04-28 10:50:01 0.99426978 1 FALSE 89.7166667 mins 7 NA
# 8: 2016-04-28 10:51:46 0.47779597 1 FALSE 91.4666667 mins 8 NA
# 9: 2016-04-28 10:57:58 0.23162579 1 FALSE 97.6666667 mins 9 NA
#10: 2016-04-28 10:58:23 0.96302423 1 FALSE 98.0833333 mins 10 NA
#11: 2016-04-28 09:13:19 0.21640794 2 TRUE 0.0000000 mins 1 4
#12: 2016-04-28 09:13:44 0.70853047 2 FALSE 0.4166667 mins 2 4
#13: 2016-04-28 09:36:44 0.75845954 2 FALSE 23.4166667 mins 3 6
#14: 2016-04-28 09:55:31 0.64050681 2 TRUE 42.2000000 mins 4 8
#15: 2016-04-28 10:00:33 0.90229905 2 FALSE 47.2333333 mins 5 9
#16: 2016-04-28 10:11:51 0.28915974 2 FALSE 58.5333333 mins 6 9
#17: 2016-04-28 10:14:14 0.79546742 2 FALSE 60.9166667 mins 7 9
#18: 2016-04-28 10:26:17 0.69070528 2 TRUE 72.9666667 mins 8 10
#19: 2016-04-28 10:51:02 0.59414202 2 FALSE 97.7166667 mins 9 NA
#20: 2016-04-28 10:56:36 0.65570580 2 TRUE 103.2833333 mins 10 NA
# at this point the problem is "solved", but you still have to extract the solution
# and that's the more complicated part
DT[, keep.new := FALSE]
# iterate over the matching indices (jumping straight to the correct one)
DT[, {
next.idx = 1
while(!is.na(next.idx)) {
set(DT, .I[next.idx], 'keep.new', TRUE)
next.idx = fut.idx[next.idx]
}
}, by = id]
DT[, identical(keep, keep.new)]
#[1] TRUE
或者对于最后一步,你可以这样做(这将遍历整个事情,但我不知道速度影响会是什么):
DT[, keep.3 := FALSE]
DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
, by = id]$V1
, keep.3 := TRUE]
DT[, identical(keep, keep.3)]
#[1] TRUE