R data.table 如果超过大型数据集的某个阈值,则将剩余的列值设置为下一列值
R data.table Setting the remainder of column values to next column value if exceeding a certain threshold for a large data set
我正在研究一个简单的调峰算法,并寻找最优化的方法,如果值超过大型时间序列的特定阈值,则将剩余的列值设置到下一列。
考虑到我有以下示例数据集,每个阈值都设置了特定的阈值,目标是获得一个 data.table,其中值被其阈值限制,其余值被添加到下一列值(不超过他们的阈值)等等到某个 window 限制。
loads <- data.table(index = 1:3,
time1 = c(6600,3000, 12000),
time2 = c(12000, 4000, 2000),
time3 = c(0, 0, 0),
time4 = c(3000,12000,0),
time5 = c(5000, 2000, 3000),
time6 = c(0, 0, 0),
time7 = c(15000, 0, 0))
thresholds <- c("time1" = 5000,
"time2" = 5000,
"time3" = 5000,
"time4" = 12000,
"time5" = 12000,
"time6" = 12000,
"time7" = 5000)
对于 7 列的 window,这应该导致以下结果 data.table:
res <- data.table(index = 1:3,
time1 = c(5000, 3000, 5000),
time2 = c(5000, 4000, 5000),
time3 = c(5000, 0, 4000),
time4 = c(6600, 12000, 0),
time5 = c(5000, 2000, 3000),
time6 = c(0, 0, 0),
time7 = c(5000, 0, 0))
我知道有一些明显的方法可以按行执行此操作,但我正在寻找更 vectorized/data.table 的方法来执行此操作。
我认为使用“只是”向量化/data.table
-规范代码并不容易(甚至可能?),但这是一个直接的 for
循环data.table
-高效(我认为)合理。
预先:我将 timeX
添加到 thresholds
(Inf
限制)和 loads
(0
的值)作为包罗万象列,以便我们知道该行的剩余部分中有多少已“丢失”。将它用于 for
循环也很方便(尽管可以在没有代码重写的情况下完成)。
library(data.table)
thresholds <- c("time1" = 5000,
"time2" = 5000,
"time3" = 5000,
"time4" = 12000,
"time5" = 12000,
"time6" = 12000,
"time7" = 5000,
"timeX" = Inf)
loads[, timeX := 0 ]
for (ind in seq_along(thresholds)) {
if (ind >= length(thresholds)) break
nm <- names(thresholds)[ind]
nm1 <- names(thresholds)[ind+1]
rmndr <- pmax(0, loads[[nm]] - thresholds[ind])
set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[ind]))
set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}
loads
# index time1 time2 time3 time4 time5 time6 time7 timeX
# <int> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 5000 5000 5000 6600 5000 0 5000 10000
# 2: 2 3000 4000 0 12000 2000 0 0 0
# 3: 3 5000 5000 4000 0 3000 0 0 0
或者如果你真的不关心丢弃的数字,那么
## using unmodified `loads` and `thresholds`
for (ind in seq_along(thresholds)) {
nm <- names(thresholds)[ind]
rmndr <- pmax(0, loads[[nm]] - thresholds[nm])
set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[nm]))
if (ind == length(thresholds)) break
nm1 <- names(thresholds)[ind+1]
set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}
我正在研究一个简单的调峰算法,并寻找最优化的方法,如果值超过大型时间序列的特定阈值,则将剩余的列值设置到下一列。
考虑到我有以下示例数据集,每个阈值都设置了特定的阈值,目标是获得一个 data.table,其中值被其阈值限制,其余值被添加到下一列值(不超过他们的阈值)等等到某个 window 限制。
loads <- data.table(index = 1:3,
time1 = c(6600,3000, 12000),
time2 = c(12000, 4000, 2000),
time3 = c(0, 0, 0),
time4 = c(3000,12000,0),
time5 = c(5000, 2000, 3000),
time6 = c(0, 0, 0),
time7 = c(15000, 0, 0))
thresholds <- c("time1" = 5000,
"time2" = 5000,
"time3" = 5000,
"time4" = 12000,
"time5" = 12000,
"time6" = 12000,
"time7" = 5000)
对于 7 列的 window,这应该导致以下结果 data.table:
res <- data.table(index = 1:3,
time1 = c(5000, 3000, 5000),
time2 = c(5000, 4000, 5000),
time3 = c(5000, 0, 4000),
time4 = c(6600, 12000, 0),
time5 = c(5000, 2000, 3000),
time6 = c(0, 0, 0),
time7 = c(5000, 0, 0))
我知道有一些明显的方法可以按行执行此操作,但我正在寻找更 vectorized/data.table 的方法来执行此操作。
我认为使用“只是”向量化/data.table
-规范代码并不容易(甚至可能?),但这是一个直接的 for
循环data.table
-高效(我认为)合理。
预先:我将 timeX
添加到 thresholds
(Inf
限制)和 loads
(0
的值)作为包罗万象列,以便我们知道该行的剩余部分中有多少已“丢失”。将它用于 for
循环也很方便(尽管可以在没有代码重写的情况下完成)。
library(data.table)
thresholds <- c("time1" = 5000,
"time2" = 5000,
"time3" = 5000,
"time4" = 12000,
"time5" = 12000,
"time6" = 12000,
"time7" = 5000,
"timeX" = Inf)
loads[, timeX := 0 ]
for (ind in seq_along(thresholds)) {
if (ind >= length(thresholds)) break
nm <- names(thresholds)[ind]
nm1 <- names(thresholds)[ind+1]
rmndr <- pmax(0, loads[[nm]] - thresholds[ind])
set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[ind]))
set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}
loads
# index time1 time2 time3 time4 time5 time6 time7 timeX
# <int> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 5000 5000 5000 6600 5000 0 5000 10000
# 2: 2 3000 4000 0 12000 2000 0 0 0
# 3: 3 5000 5000 4000 0 3000 0 0 0
或者如果你真的不关心丢弃的数字,那么
## using unmodified `loads` and `thresholds`
for (ind in seq_along(thresholds)) {
nm <- names(thresholds)[ind]
rmndr <- pmax(0, loads[[nm]] - thresholds[nm])
set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[nm]))
if (ind == length(thresholds)) break
nm1 <- names(thresholds)[ind+1]
set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}