基于 data.table R 中列的不等式 *rolling* 条件的滚动值

rolling value based on an inequality *rolling* condition of a column in data.table R

我有一个很大的 data.table(约 5000 万行),其中包含多个 ID 列(ID1、ID2),用于对行进行分组。我想根据不等式滚动条件移动值列,我将在下面概述。什么是滚动条件?我只是编造了这个词。这意味着条件也发生了变化(滚动)。

> require(data.table)
> DT = data.table(ID1 = c(rep(1,7), rep(2,4)), ID2 = c('A','B','B','C','C','A','A','D','D','E','D'), Value = (1:11))

输出: 如果最后一行的 ID2 与该行的 ID2 不同,则返回具有相同 ID1 的最后一行的值(因此分组依据 = ID1)。

> DT
    ID1 ID2 Value desired_output
 1:   1   A     1             NA  -- no previous row with the same ID1 and different ID2
 2:   1   B     2              1  -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
 3:   1   B     3              1  -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
 4:   1   C     4              3  -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
 5:   1   C     5              3  -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
 6:   1   A     6              5  -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
 7:   1   A     7              5  -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
 8:   2   D     8             NA  -- no previous row with the same ID1 and different ID2
 9:   2   D     9             NA  -- no previous row with the same ID1 and different ID2
10:   2   E    10              9  -- last row with same ID1 and different ID2 is row 9, so desired_output is the Value of row 9
11:   2   D    11             10  -- last row with same ID1 and different ID2 is row 10, so desired_output is the Value of row 10

正在寻找一种计算效率高的方法。

这是一种方法:

DT[, rleid_id2 := rleid(ID2), by = .(ID1)]
DT[DT, on = .(ID1, rleid_id2 > rleid_id2), val := i.Value]

> DT
    ID1 ID2 Value rleid_id2 val
 1:   1   A     1         1  NA
 2:   1   B     2         2   1
 3:   1   B     3         2   1
 4:   1   C     4         3   3
 5:   1   C     5         3   3
 6:   1   A     6         4   5
 7:   1   A     7         4   5
 8:   2   D     8         1  NA
 9:   2   D     9         1  NA
10:   2   E    10         2   9
11:   2   D    11         3  10

使用 rleid 我们可以将 ID2 的序列变成一个编号的 id,它在每次更改时递增,并按 ID1 分组。然后我们在 rleid_id2(加上 ID1)上使用非相等条件将 DT 加入自身,以便我们也提取具有相同 ID1 但较低 rleid_id2 值的值。