R data.table 如果超过大型数据集的某个阈值,则将剩余的列值设置为下一列值

R data.table Setting the remainder of column values to next column value if exceeding a certain threshold for a large data set

我正在研究一个简单的调峰算法,并寻找最优化的方法,如果值超过大型时间序列的特定阈值,则将剩余的列值设置到下一列。

考虑到我有以下示例数据集,每个阈值都设置了特定的阈值,目标是获得一个 data.table,其中值被其阈值限制,其余值被添加到下一列值(不超过他们的阈值)等等到某个 window 限制。

loads <- data.table(index = 1:3,
                    time1 = c(6600,3000, 12000),
                    time2 = c(12000, 4000, 2000),
                    time3 = c(0, 0, 0),
                    time4 = c(3000,12000,0),
                    time5 = c(5000, 2000, 3000),
                    time6 = c(0, 0, 0),
                    time7 = c(15000, 0, 0))

thresholds <- c("time1" = 5000, 
                "time2" = 5000,
                "time3" = 5000,
                "time4" = 12000,
                "time5" = 12000,
                "time6" = 12000,
                "time7" = 5000)

对于 7 列的 window,这应该导致以下结果 data.table:

res <- data.table(index = 1:3,
                  time1 = c(5000, 3000, 5000),
                  time2 = c(5000, 4000, 5000),
                  time3 = c(5000, 0, 4000),
                  time4 = c(6600, 12000, 0),
                  time5 = c(5000, 2000, 3000),
                  time6 = c(0, 0, 0),
                  time7 = c(5000, 0, 0))

我知道有一些明显的方法可以按行执行此操作,但我正在寻找更 vectorized/data.table 的方法来执行此操作。

我认为使用“只是”向量化/data.table-规范代码并不容易(甚至可能?),但这是一个直接的 for 循环data.table-高效(我认为)合理。

预先:我将 timeX 添加到 thresholdsInf 限制)和 loads0 的值)作为包罗万象列,以便我们知道该行的剩余部分中有多少已“丢失”。将它用于 for 循环也很方便(尽管可以在没有代码重写的情况下完成)。

library(data.table)
thresholds <- c("time1" = 5000, 
                "time2" = 5000,
                "time3" = 5000,
                "time4" = 12000,
                "time5" = 12000,
                "time6" = 12000,
                "time7" = 5000,
                "timeX" = Inf)
loads[, timeX := 0 ]

for (ind in seq_along(thresholds)) {
  if (ind >= length(thresholds)) break
  nm <- names(thresholds)[ind]
  nm1 <- names(thresholds)[ind+1]
  rmndr <- pmax(0, loads[[nm]] - thresholds[ind])
  set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[ind]))
  set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}
loads
#    index time1 time2 time3 time4 time5 time6 time7 timeX
#    <int> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:     1  5000  5000  5000  6600  5000     0  5000 10000
# 2:     2  3000  4000     0 12000  2000     0     0     0
# 3:     3  5000  5000  4000     0  3000     0     0     0

或者如果你真的不关心丢弃的数字,那么

## using unmodified `loads` and `thresholds`
for (ind in seq_along(thresholds)) {
  nm <- names(thresholds)[ind]
  rmndr <- pmax(0, loads[[nm]] - thresholds[nm])
  set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[nm]))
  if (ind == length(thresholds)) break
  nm1 <- names(thresholds)[ind+1]
  set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}