基于 data.table R 中列的不等式 *rolling* 条件的滚动值
rolling value based on an inequality *rolling* condition of a column in data.table R
我有一个很大的 data.table(约 5000 万行),其中包含多个 ID 列(ID1、ID2),用于对行进行分组。我想根据不等式滚动条件移动值列,我将在下面概述。什么是滚动条件?我只是编造了这个词。这意味着条件也发生了变化(滚动)。
> require(data.table)
> DT = data.table(ID1 = c(rep(1,7), rep(2,4)), ID2 = c('A','B','B','C','C','A','A','D','D','E','D'), Value = (1:11))
输出:
如果最后一行的 ID2 与该行的 ID2 不同,则返回具有相同 ID1 的最后一行的值(因此分组依据 = ID1)。
> DT
ID1 ID2 Value desired_output
1: 1 A 1 NA -- no previous row with the same ID1 and different ID2
2: 1 B 2 1 -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
3: 1 B 3 1 -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
4: 1 C 4 3 -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
5: 1 C 5 3 -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
6: 1 A 6 5 -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
7: 1 A 7 5 -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
8: 2 D 8 NA -- no previous row with the same ID1 and different ID2
9: 2 D 9 NA -- no previous row with the same ID1 and different ID2
10: 2 E 10 9 -- last row with same ID1 and different ID2 is row 9, so desired_output is the Value of row 9
11: 2 D 11 10 -- last row with same ID1 and different ID2 is row 10, so desired_output is the Value of row 10
正在寻找一种计算效率高的方法。
这是一种方法:
DT[, rleid_id2 := rleid(ID2), by = .(ID1)]
DT[DT, on = .(ID1, rleid_id2 > rleid_id2), val := i.Value]
> DT
ID1 ID2 Value rleid_id2 val
1: 1 A 1 1 NA
2: 1 B 2 2 1
3: 1 B 3 2 1
4: 1 C 4 3 3
5: 1 C 5 3 3
6: 1 A 6 4 5
7: 1 A 7 4 5
8: 2 D 8 1 NA
9: 2 D 9 1 NA
10: 2 E 10 2 9
11: 2 D 11 3 10
使用 rleid
我们可以将 ID2
的序列变成一个编号的 id,它在每次更改时递增,并按 ID1 分组。然后我们在 rleid_id2
(加上 ID1)上使用非相等条件将 DT 加入自身,以便我们也提取具有相同 ID1 但较低 rleid_id2
值的值。
我有一个很大的 data.table(约 5000 万行),其中包含多个 ID 列(ID1、ID2),用于对行进行分组。我想根据不等式滚动条件移动值列,我将在下面概述。什么是滚动条件?我只是编造了这个词。这意味着条件也发生了变化(滚动)。
> require(data.table)
> DT = data.table(ID1 = c(rep(1,7), rep(2,4)), ID2 = c('A','B','B','C','C','A','A','D','D','E','D'), Value = (1:11))
输出: 如果最后一行的 ID2 与该行的 ID2 不同,则返回具有相同 ID1 的最后一行的值(因此分组依据 = ID1)。
> DT
ID1 ID2 Value desired_output
1: 1 A 1 NA -- no previous row with the same ID1 and different ID2
2: 1 B 2 1 -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
3: 1 B 3 1 -- last row with same ID1 and different ID2 is row 1, so desired_output is the Value of row 1
4: 1 C 4 3 -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
5: 1 C 5 3 -- last row with same ID1 and different ID2 is row 3, so desired_output is the Value of row 3
6: 1 A 6 5 -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
7: 1 A 7 5 -- last row with same ID1 and different ID2 is row 5, so desired_output is the Value of row 5
8: 2 D 8 NA -- no previous row with the same ID1 and different ID2
9: 2 D 9 NA -- no previous row with the same ID1 and different ID2
10: 2 E 10 9 -- last row with same ID1 and different ID2 is row 9, so desired_output is the Value of row 9
11: 2 D 11 10 -- last row with same ID1 and different ID2 is row 10, so desired_output is the Value of row 10
正在寻找一种计算效率高的方法。
这是一种方法:
DT[, rleid_id2 := rleid(ID2), by = .(ID1)]
DT[DT, on = .(ID1, rleid_id2 > rleid_id2), val := i.Value]
> DT
ID1 ID2 Value rleid_id2 val
1: 1 A 1 1 NA
2: 1 B 2 2 1
3: 1 B 3 2 1
4: 1 C 4 3 3
5: 1 C 5 3 3
6: 1 A 6 4 5
7: 1 A 7 4 5
8: 2 D 8 1 NA
9: 2 D 9 1 NA
10: 2 E 10 2 9
11: 2 D 11 3 10
使用 rleid
我们可以将 ID2
的序列变成一个编号的 id,它在每次更改时递增,并按 ID1 分组。然后我们在 rleid_id2
(加上 ID1)上使用非相等条件将 DT 加入自身,以便我们也提取具有相同 ID1 但较低 rleid_id2
值的值。