查找一列的累积和,直到满足另一列的条件和
Find a cumulative sum of one column until a conditional sum on another column is met
我想为 B 列的那些行找到前面的 cumsum(即 cumsum 减去当前行),直到 A 列的前几行(包括当前行)的总和 <= 7。
我能够使用传统的 for 循环找到答案。矢量化实现会非常有用,因为我需要在大型数据集上 运行 它。分享我的简单代码以防有帮助。
dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6),
new=rep(0,11))
dt3 <- dt
for (i in 2:nrow(dt3)){
set<-0
count<-0
k=i-1
for (j in k:1){
count=count+dt3$A[j+1]
if(count<=7){
set<-set+dt3$B[j]
if(j==1){
dt3$new[i]=set
}
}
else{
dt3$new[i]=set
}
}
}
满足以下3个条件:
- 如果 A > 7,则 Ans 重置为 0
- 如果cumsum(A)<=7,那么Ans就是lagB的cumsum()
- 如果 cumsum(A) > 7,则 Ans 是 lagB 的 cumsum() 对于 A 的前几行的范围,其总和 <=7
这是数据的简化版本(A 和 B 列),所需的输出是 Ans 列:
dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))
dt
A B Ans Reason for value in Ans:
1 0 1 0 There are no preceeding rows in B so Ans is 0
2 2 0 1 Sum of value of A from row 2 to 1 is 2 <=7. So Ans is the value of B from first row = 1
3 3 4 1 Sum of value of A from row 3,2 and 1 is 5 <=7. So Ans is the sum of value of B in row 1 and 2, which is 1.
4 5 2 4 Value of A from row 4 is 5 which is <=7. So Ans is value of B from row 3, which is 4
5 8 3 0 Value of A in row 5 is 8 which is >7. So Ans is 0 (Value of Ans resets to 0 when A > 7).
6 90 4 0
7 8 2 0
8 2 1 2 Value of A in row 8 is 2 which <=7, so Ans is value of B in row 7 which is 2
9 4 2 3 Sum of value of A from row 9 and 8 is 6<=7, so Ans is sum of value of B in row 8 and 7 = 3
10 1 3 5 Sum of value of A from row 10,9 and 8 is 7<=7, so Ans is sum of value of B in row 9,8 and 7 =5.
11 2 1 6 Sum of value of A from row 11,10 and 9 is 7<=7, so Ans is sum of value of B in row 10,9 and 8 =6.
关于如何在 R 中编写此代码的任何帮助?
请参阅下面的编辑,它试图回答更新后的问题。
如果我理解了OP的意图,那么有3条规则:
- 如果
A
大于 7 则 Ans
为零并重新开始分组
- 如果组内的
cumsum(A)
小于或等于7则Ans
是滞后B
的cumsum()
- 如果组内
cumsum(A)
大于 7,则 Ans
滞后 B
下面的代码为给定的示例数据集生成预期结果:
# create sample data set
DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5))
# load data.table, CRAN version 1.10.4 used
library(data.table)
# coerce to data.table
DT <- data.table(DF)
# create helper column with lagged values of
DT[, lagB := shift(B, fill = 0)][]
# create new answer
DT[, new := (A <= 7) * ifelse(cumsum(A) <= 7, cumsum(lagB), lagB), by = rleid(A <= 7)][
, lagB := NULL][]
A B Ans new
1: 0 1 0 0
2: 2 0 1 1
3: 3 4 1 1
4: 5 2 4 4
5: 8 3 0 0
6: 90 4 0 0
7: 8 2 0 0
8: 2 1 2 2
9: 4 2 3 3
10: 1 3 5 5
rleid(A <= 7)
为所有连续的 A
值不大于或大于 7 的连续条纹创建唯一的组编号。 ifelse()
子句在分组中实现规则 2 和 3。通过将结果与(A <= 7)
相乘,实现了规则1,从而利用了as.numeric(TRUE)
为1,as.numeric(FALSE)
为0的技巧,最后去掉了helper列。
编辑
根据 OP 提供的附加信息,我相信 只剩下一个 规则:
- 对每一行求一个向后延伸的window,其中包含的行数不超过
sum(A)
,不超过7。答案是同一个[=中滞后的B
的总和68=].
- 澄清一下,如果 window 的长度为零,因为初始行中的
A
已经超过 7,则答案为零。
滑动的可变长度window是这里棘手的部分:
# sample data set consists of 11 rows after OP's edit
DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))
DT <- data.table(DF)
DT[, lagB := shift(B, fill = 0)][]
# find window lengths
DT[, wl := DT[, Reduce(`+`, shift(A, 0:6, fill = 0), accumulate = TRUE)][, rn := .I][
, Position(function(x) x <= 7, right = TRUE, unlist(.SD)), by = rn]$V1][]
# sum lagged B in respective window
DT[, new := DT[, Reduce(`+`, shift(lagB, 0:6, fill = 0), accumulate = TRUE)][
, rn := .I][, wl := DT$wl][, ifelse(is.na(wl), 0, unlist(.SD)[wl]), by = rn]$V1][]
A B Ans lagB wl new
1: 0 1 0 0 7 0
2: 2 0 1 1 7 1
3: 3 4 1 0 7 1
4: 5 2 4 4 1 4
5: 8 3 0 2 NA 0
6: 90 4 0 3 NA 0
7: 8 2 0 4 NA 0
8: 2 1 2 2 1 2
9: 4 2 3 1 2 3
10: 1 3 5 2 3 5
11: 2 1 6 3 3 6
我想为 B 列的那些行找到前面的 cumsum(即 cumsum 减去当前行),直到 A 列的前几行(包括当前行)的总和 <= 7。
我能够使用传统的 for 循环找到答案。矢量化实现会非常有用,因为我需要在大型数据集上 运行 它。分享我的简单代码以防有帮助。
dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6),
new=rep(0,11))
dt3 <- dt
for (i in 2:nrow(dt3)){
set<-0
count<-0
k=i-1
for (j in k:1){
count=count+dt3$A[j+1]
if(count<=7){
set<-set+dt3$B[j]
if(j==1){
dt3$new[i]=set
}
}
else{
dt3$new[i]=set
}
}
}
满足以下3个条件:
- 如果 A > 7,则 Ans 重置为 0
- 如果cumsum(A)<=7,那么Ans就是lagB的cumsum()
- 如果 cumsum(A) > 7,则 Ans 是 lagB 的 cumsum() 对于 A 的前几行的范围,其总和 <=7
这是数据的简化版本(A 和 B 列),所需的输出是 Ans 列:
dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))
dt
A B Ans Reason for value in Ans:
1 0 1 0 There are no preceeding rows in B so Ans is 0
2 2 0 1 Sum of value of A from row 2 to 1 is 2 <=7. So Ans is the value of B from first row = 1
3 3 4 1 Sum of value of A from row 3,2 and 1 is 5 <=7. So Ans is the sum of value of B in row 1 and 2, which is 1.
4 5 2 4 Value of A from row 4 is 5 which is <=7. So Ans is value of B from row 3, which is 4
5 8 3 0 Value of A in row 5 is 8 which is >7. So Ans is 0 (Value of Ans resets to 0 when A > 7).
6 90 4 0
7 8 2 0
8 2 1 2 Value of A in row 8 is 2 which <=7, so Ans is value of B in row 7 which is 2
9 4 2 3 Sum of value of A from row 9 and 8 is 6<=7, so Ans is sum of value of B in row 8 and 7 = 3
10 1 3 5 Sum of value of A from row 10,9 and 8 is 7<=7, so Ans is sum of value of B in row 9,8 and 7 =5.
11 2 1 6 Sum of value of A from row 11,10 and 9 is 7<=7, so Ans is sum of value of B in row 10,9 and 8 =6.
关于如何在 R 中编写此代码的任何帮助?
请参阅下面的编辑,它试图回答更新后的问题。
如果我理解了OP的意图,那么有3条规则:
- 如果
A
大于 7 则Ans
为零并重新开始分组 - 如果组内的
cumsum(A)
小于或等于7则Ans
是滞后B
的 - 如果组内
cumsum(A)
大于 7,则Ans
滞后B
cumsum()
下面的代码为给定的示例数据集生成预期结果:
# create sample data set
DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5))
# load data.table, CRAN version 1.10.4 used
library(data.table)
# coerce to data.table
DT <- data.table(DF)
# create helper column with lagged values of
DT[, lagB := shift(B, fill = 0)][]
# create new answer
DT[, new := (A <= 7) * ifelse(cumsum(A) <= 7, cumsum(lagB), lagB), by = rleid(A <= 7)][
, lagB := NULL][]
A B Ans new 1: 0 1 0 0 2: 2 0 1 1 3: 3 4 1 1 4: 5 2 4 4 5: 8 3 0 0 6: 90 4 0 0 7: 8 2 0 0 8: 2 1 2 2 9: 4 2 3 3 10: 1 3 5 5
rleid(A <= 7)
为所有连续的 A
值不大于或大于 7 的连续条纹创建唯一的组编号。 ifelse()
子句在分组中实现规则 2 和 3。通过将结果与(A <= 7)
相乘,实现了规则1,从而利用了as.numeric(TRUE)
为1,as.numeric(FALSE)
为0的技巧,最后去掉了helper列。
编辑
根据 OP 提供的附加信息,我相信 只剩下一个 规则:
- 对每一行求一个向后延伸的window,其中包含的行数不超过
sum(A)
,不超过7。答案是同一个[=中滞后的B
的总和68=]. - 澄清一下,如果 window 的长度为零,因为初始行中的
A
已经超过 7,则答案为零。
滑动的可变长度window是这里棘手的部分:
# sample data set consists of 11 rows after OP's edit
DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))
DT <- data.table(DF)
DT[, lagB := shift(B, fill = 0)][]
# find window lengths
DT[, wl := DT[, Reduce(`+`, shift(A, 0:6, fill = 0), accumulate = TRUE)][, rn := .I][
, Position(function(x) x <= 7, right = TRUE, unlist(.SD)), by = rn]$V1][]
# sum lagged B in respective window
DT[, new := DT[, Reduce(`+`, shift(lagB, 0:6, fill = 0), accumulate = TRUE)][
, rn := .I][, wl := DT$wl][, ifelse(is.na(wl), 0, unlist(.SD)[wl]), by = rn]$V1][]
A B Ans lagB wl new 1: 0 1 0 0 7 0 2: 2 0 1 1 7 1 3: 3 4 1 0 7 1 4: 5 2 4 4 1 4 5: 8 3 0 2 NA 0 6: 90 4 0 3 NA 0 7: 8 2 0 4 NA 0 8: 2 1 2 2 1 2 9: 4 2 3 1 2 3 10: 1 3 5 2 3 5 11: 2 1 6 3 3 6