For循环：索引调用多个不同列值时如何避免循环？

Question

我有一个庞大的医疗记录数据集（2000 万行）。我想创建一个新列，由每个组中特定行的值填充。

数据是什么样的

数据如下所示：

data <- data.frame(
        ICUSTAY_ID = c(1,1,5,5,5,5,5,6,6,6,6),
        DATA = c(0,0,0,0,1,0,0,0,0,1,0), 
        OFFSET = c(-20,0,-1500, 150, 155, 159, 300, -2000, 30, 100, 120),
        AA_FIRST = c(NA, NA, NA, NA, 1, NA, NA, NA,NA,1,NA), 
        LABRESULT = c(4, 5, 3.5, 4.1, NA, 3.0, 5.5, 2.1, 2.5, NA, 3.5) )


          ID DATA OFFSET AA_FIRST LABRESULT
          1    0    -20       NA       4.0
          1    0      0       NA       5.0
          5    0  -1500       NA       3.5
          5    0    150       NA       4.1
          5    1    155        1        NA
          5    0    159       NA       3.0
          5    0    300       NA       5.5
          6    0  -2000       NA       2.1
          6    0     30       NA       2.5
          6    1    100        1        NA
          6    0    120       NA       3.5

我希望数据看起来像什么

对于每组 ID，我想在该组中找到 AA_FIRST=1 的行（每个 ID 组只有 1 个），找到该行的 OFFSET 值，然后将此 OFFSET 值粘贴到名为 refOFFSET 的新列中，用于 ID 的 all 行。我希望结果如下所示：

          ID DATA OFFSET AA_FIRST LABRESULT refOFFSET
          1    0    -20       NA       4.0        NA
          1    0      0       NA       5.0        NA
          5    0  -1500       NA       3.5       155
          5    0    150       NA       4.1       155
          5    1    155        1        NA       155
          5    0    159       NA       3.0       155
          5    0    300       NA       5.5       155
          6    0  -2000       NA       2.1       100
          6    0     30       NA       2.5       100
          6    1    100        1        NA       100
          6    0    120       NA       3.5       100

ID 组 5 的 AA_FIRST=1 对应于 OFFSET 155，因此 ID=5 的所有行的 refOFFSET 列都填充了 155。

ID 组 6 的 AA_FIRST=1 对应于 OFFSET 100，因此 ID=6 的所有行的 refOFFSET 列都填充了 100。

ID 组不一定有 AA_FIRST=1。 ID组1就是这种情况。ID组1没有任何AA_FIRST=1，所以它的refOFFSET是NA。

并非所有 ID 值都存在。例如ID号2、3、4不存在

我目前的做法

我现在执行此操作的代码由 for 循环和 if/else 语句组成。我想提出一个矢量化或 apply 形式。我的 for 循环需要 2000 万行的时间太长。

data$refOFFSET <- NA #initialize column called refOFFSET

for (i in 1:length(data$ID)){
        if (!length(which(data$ID==(data$ID[i]) & data$AA_FIRST==1))) { #if it's integer0
                next #go on to next i
        }else{
                tmpval <- data$OFFSET[which(data$ID==(data$ID[i]) & data$AA_FIRST==1)]} 
        data$refOFFSET[i] <- tmpval #create column whose value is equal to the reference OFFSET for each ID (i.e. the OFFSET where AA_FIRST=1)
}

问题

有谁知道如何将上述代码写成向量化或应用形式？有什么可以加快计算速度的吗？谢谢！

编辑：我的可重现示例数据和显示的原始数据略有不同。我更正了这个。

Answer 1

我们可以试试dplyr。通过 ICUSTAY_ID 对数据帧进行分组，我们找到了 AA_FIRST=1 的最小索引，并为整个组使用相应的 OFFSET 值。

library(dplyr)
data %>%
  group_by(ICUSTAY_ID) %>%
  mutate(refOFFSET = OFFSET[which.min(AA_FIRST == 1)])

#   ICUSTAY_ID  DATA OFFSET AA_FIRST LABRESULT refOFFSET
#    <dbl> <dbl>  <dbl>    <dbl>     <dbl>     <dbl>
#1      1     0    -20       NA       4.0        NA
#2      1     0      0       NA       5.0        NA
#3      5     0  -1500       NA       3.5       155
#4      5     0    150       NA       4.1       155
#5      5     1    155        1        NA       155
#6      5     0    159       NA       3.0       155
#7      5     0    300       NA       5.5       155
#8      6     0  -2000       NA       2.1       100
#9      6     0     30       NA       2.5       100
#10     6     1    100        1        NA       100
#11     6     0    120       NA       3.5       100

Answer 2

您可以创建自己的函数并像这样使用应用：

my_function<-function(input_vector){
    if(is.na(input_vector[4])){return(NA)}
    if(input_vector[4]==1){
        return(input_vector[3])
    }else{retun(NA)}
}

data<- data.frame(
        ID = c(5,5,5,5,5,6,6,6,6),
        DATA = c(0,0,1,0,0,0,0,1,0), 
        OFFSET = c(-1500, 150, 155, 159, 300, -2000, 30, 100, 120), 
        AA_FIRST = c(NA, NA, 1, NA, NA, NA,NA,1,NA), 
        LABRESULT = c(3.5, 4.1, NA, 3.0, 5.5, 2.1, 2.5, NA, 3.5) )

ref_col=apply(data,1,my_function)
data[,'refOFFSET']=ref_col

refOFFSET_val_idx=which(!is.na(ref_col))
refOFFEST_lookup_df=data[refOFFSET_val_idx,c('ID','refOFFSET')]
for(i in 1:nrow(refOFFEST_lookup_df)){
    ID_to_change_idx=which(data$ID==refOFFEST_lookup_df[i,'ID'])
    data[ID_to_change_idx,'refOFFSET']=refOFFEST_lookup_df[i,'refOFFSET']
}

Answer 3

这是一个使用 data.table 的选项。将'data.frame'转换为'data.table'（setDT(data)），按"ICUSTAY_ID"分组，我们得到'AA_FIRST'中1值的索引，得到对应的'OFFSET' 值并分配 (:=) 它以创建 'refOFFSET'。这应该是非常有效的，因为我们正在分配。

library(data.table)
setDT(data)[, refOFFSET := OFFSET[match(1, AA_FIRST)], by = ICUSTAY_ID]
data
#    ICUSTAY_ID DATA OFFSET AA_FIRST LABRESULT refOFFSET
# 1:          1    0    -20       NA       4.0        NA
# 2:          1    0      0       NA       5.0        NA
# 3:          5    0  -1500       NA       3.5       155
# 4:          5    0    150       NA       4.1       155
# 5:          5    1    155        1        NA       155
# 6:          5    0    159       NA       3.0       155
# 7:          5    0    300       NA       5.5       155
# 8:          6    0  -2000       NA       2.1       100
# 9:          6    0     30       NA       2.5       100
#10:          6    1    100        1        NA       100
#11:          6    0    120       NA       3.5       100

Answer 4

您还可以对 AA_FIRST == 1 所在的行进行子集并将其保存为查找 table（类似于 Python 中的字典），然后根据 ID.

data<- data.frame(
  ID = c(1,1,5,5,5,5,5,6,6,6,6),
  DATA = c(0,0,0,0,1,0,0,0,0,1,0), 
  OFFSET = c(-20,0,-1500, 150, 155, 159, 300, -2000, 30, 100, 120), 
  AA_FIRST = c(NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, NA), 
  LABRESULT = c(4.0, 5.0, 3.5, 4.1, NA, 3.0, 5.5, 2.1, 2.5, NA, 3.5) )

dict <- subset(data, data$AA_FIRST==1)[c("ID", "OFFSET")]

data$refOFFSET <- dict[match(data$ID, dict$ID), 2]

Answer 5

使用 match 查找索引似乎解决了您在 Ronak Shah 的解决方案中提到的不兼容大小错误。

  data %>%
    group_by(ID) %>%
    mutate(refOFFSET = OFFSET[match(TRUE, AA_FIRST==1)])

"For a logical vector x with both FALSE and TRUE values, which.min(x) and which.max(x) return the index of the first FALSE or TRUE, respectively, as FALSE < TRUE. However, match(FALSE, x) or match(TRUE, x) are typically preferred, as they do indicate mismatches."

http://stat.ethz.ch/R-manual/R-devel/library/base/html/which.min.html

Answer 6

您应该可以通过过滤然后重新合并数据框来完成。

    foo <- data[! is.na(data$AA_FIRST),c('ID','OFFSET')]
    colnames(foo) <- c("ID", "refOFFSET")
    result <- merge(data, foo, on = "ID")

For循环：索引调用多个不同列值时如何避免循环？

For Loops: How to avoid loop when index is used to call multiple different column values?

performance

for-loop

r

vectorization

apply

数据是什么样的

我希望数据看起来像什么

我目前的做法

问题