如何在 r 中避免这个 for 循环

Question

我正在尝试获取 DT$pna 列中峰值和谷值事件之间的最大值，这些事件在 data.table 中各自的列中找到（即 DT$peak、DT$through）。 DT$peaks 和 DT$troughs 有字符串 "peak" 和 "trough" 来标记后续事件的开始和结束。这个 for 循环使用非常减少的样本，但是因为 data.table 有数百万行，所以它永远需要。有没有更好的解决方案（可能使用数据table）在这种情况下更有效地获得最大值？

for (i in 1:nrow(DT)) {
  if(is.na(DT$peak[i])) {
    next
  }
  if(DT$peak[i] == "peak") {
    e <- i + 15000
    for (j in i:e) {
      if(is.na(DT$trough[j])) {
        next
      }
      if(DT$trough[j] == "trough") {
        x <- (DT$pna[i:j])
      }
    }  
  }
  DT[i, max_insp := max(x)]
}

Answer 1

这里有一个选项：

DT[, rn := .I]

#use rolling join to find the nearest trough
DT[!is.na(peak), nt := DT[!is.na(trough)][.SD, on=.(rn), roll=-Inf, x.rn]]

#use non-equi join to find the max
DT[!is.na(peak), max_insp :=
    DT[.SD, on=.(rn>=rn, rn<=nt), by=.EACHI, max(x.pna)]$V1
]

另一个选项（如果你有很多波峰和波谷但可能不太可读，可能会更快）：

DT[, c("pix", "tix") := .(nafill(replace(.I, is.na(peak), NA_integer_), "locf"), 
  nafill(replace(.I, is.na(trough), NA_integer_), "nocb"))]

iv <- DT[order(pix, tix, -pna)][{
    ri <- rleid(pix, tix)
    ri!=shift(ri, fill=0L) & !is.na(pix) & !is.na(tix)
  }]

DT[iv$pix, max_insp := iv$pna]

输出：

    peak trough          pna rn nt max_insp
 1: <NA>   <NA>  1.262954285  1 NA       NA
 2: peak   <NA> -0.326233361  2 11 2.404653
 3: <NA>   <NA>  1.329799263  3 NA       NA
 4: <NA>   <NA>  1.272429321  4 NA       NA
 5: <NA>   <NA>  0.414641434  5 NA       NA
 6: <NA>   <NA> -1.539950042  6 NA       NA
 7: <NA>   <NA> -0.928567035  7 NA       NA
 8: <NA>   <NA> -0.294720447  8 NA       NA
 9: <NA>   <NA> -0.005767173  9 NA       NA
10: <NA>   <NA>  2.404653389 10 NA       NA
11: <NA> trough  0.763593461 11 NA       NA
12: <NA>   <NA> -0.799009249 12 NA       NA

数据：

library(data.table)
set.seed(0L)
DT <- data.table(peak=c(NA, "peak", rep(NA, 10)), 
    trough=c(rep(NA, 10), "trough", NA),
    pna=rnorm(12))

如何在 r 中避免这个 for 循环

How to avoid this for loop in r

r

processing-efficiency

data.table