如何优化 R 中的 sapply 以计算数据帧上的 运行 总数

How do I optimize sapply in R to calculate running totals on a dataframe

我在 R 中编写了一个函数来按月份计算累计总数,但是随着数据集变大,我的方法的执行时间呈指数级增长。我是 R 程序员新手,你能帮我提高效率吗?
函数和调用函数的方式:

accumulate <- function(recordnum,df){
    sumthese <- (df$subject == df$subject[recordnum]) &
        (df$month <= df$month[recordnum])
    sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
                 subject=rep(c("dog","cat"),each =datalength/2),
                 month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))

输入数据帧:

> df
   measurement subject month
1    0.4577418     dog     5
2    0.7191123     dog     4
3    0.9346722     dog     3
4    0.2554288     dog     2
5    0.4622928     dog     1
6    0.9400145     cat     5
7    0.9782264     cat     4
8    0.1174874     cat     3
9    0.4749971     cat     2
10   0.5603327     cat     1

输出数据帧:

> df
   measurement subject month cumulative
1    0.9148060     dog     5  3.6102141
2    0.9370754     dog     4  2.6954081
3    0.2861395     dog     3  1.7583327
4    0.8304476     dog     2  1.4721931
5    0.6417455     dog     1  0.6417455
6    0.5190959     cat     5  2.7524079
7    0.7365883     cat     4  2.2333120
8    0.1346666     cat     3  1.4967237
9    0.6569923     cat     2  1.3620571
10   0.7050648     cat     1  0.7050648

请注意,累积列显示了截至并包括当前月份的所有测量值的累积。该函数不需要对数据框进行排序。当 datalength 等于 100 时,经过的时间为 0.3。 1000 是 0.58。 10,000 = 27.72。我需要这个 运行 200K+ 记录。
谢谢!

与其使用自定义函数,不如使用内置的 R 函数 bycumsum

df <- df[order(df$subject,df$month),]
df <- cbind(df,
            cumulative=do.call(what=c,
                               args=by(data=df$measurement,
                               INDICES=df$subject,
                               FUN=cumsum)))
print(df)

   measurement subject month cumulative
10   0.7050648     cat     1  0.7050648
9    0.6569923     cat     2  1.3620571
8    0.1346666     cat     3  1.4967237
7    0.7365883     cat     4  2.2333120
6    0.5190959     cat     5  2.7524079
5    0.6417455     dog     1  0.6417455
4    0.8304476     dog     2  1.4721931
3    0.2861395     dog     3  1.7583327
2    0.9370754     dog     4  2.6954081
1    0.9148060     dog     5  3.6102141

cumsum 创建累积和,by 允许您进行分组处理(返回列表 - 另一种选择是 aggreagate,它为您提供数据框)。只要数据排序正确,这就会为您提供正确的数据。

dplyr 会让这变得非常简单

library(dplyr)
df %>%
    group_by(subject) %>%
    arrange(month) %>%
    mutate(cumulative = cumsum(measurement))

Source: local data frame [10 x 4]
Groups: subject

   measurement subject month cumulative
1    0.7050648     cat     1  0.7050648
2    0.6569923     cat     2  1.3620571
3    0.1346666     cat     3  1.4967237
4    0.7365883     cat     4  2.2333120
5    0.5190959     cat     5  2.7524079
6    0.6417455     dog     1  0.6417455
7    0.8304476     dog     2  1.4721931
8    0.2861395     dog     3  1.7583327
9    0.9370754     dog     4  2.6954081
10   0.9148060     dog     5  3.6102141

虽然如果您正在寻找绝对性能,您可能想要使用 data.table

library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]    

#     measurement subject month cumulative
#  1:   0.7050648     cat     1  0.7050648
#  2:   0.6569923     cat     2  1.3620571
#  3:   0.1346666     cat     3  1.4967237
#  4:   0.7365883     cat     4  2.2333120
#  5:   0.5190959     cat     5  2.7524079
#  6:   0.6417455     dog     1  0.6417455
#  7:   0.8304476     dog     2  1.4721931
#  8:   0.2861395     dog     3  1.7583327
#  9:   0.9370754     dog     4  2.6954081
# 10:   0.9148060     dog     5  3.6102141

这是非破坏性的,即原始的df没有被修改。没有使用包。 df行的原始顺序被保留;但是,如果这不重要,则可以省略最后一行的 [order(o), ]

o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]

给予:

   measurement subject month cumulative
1   0.37955924     dog     5  2.2580530
2   0.43577158     dog     4  1.8784938
3   0.03743103     dog     3  1.4427222
4   0.97353991     dog     2  1.4052912
5   0.43175125     dog     1  0.4317512
6   0.95757660     cat     5  4.0751151
7   0.88775491     cat     4  3.1175385
8   0.63997877     cat     3  2.2297836
9   0.97096661     cat     2  1.5898048
10  0.61883821     cat     1  0.6188382

此函数采用测量值和月份向量,计算出如何按月对数据进行排序,然后计算按月排序的测量值的累计和,返回原始顺序(使用 (x[o])[order(o)] == x))

FUN <- function(measure, month) {
    o <- order(month)
    cumsum(measure[o])[order(o)]
}

因此,如果您要根据主题将测量值和月份拆分到一个列表中,您可以将每个元素从旧值映射到新值

Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))

假设隐含的 'geometry' 是一致的,split()<- 会进行簿记以将值列表分配到它们在向量中的正确位置

df$cumulative <- NA_real_   # or add this column to df's construction
split(df$cumulative, df$subject) <- 
    Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))

我认为到目前为止,这是唯一保留数据原始顺序的解决方案(大概可以将步骤添加到其他解决方案...)

这似乎是线性扩展的,至少随着行数变大

f0 <- function(df) {
    split(df$cumulative, df$subject) <- 
        Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
    df
}

df <- lapply(10^(3:6), function(datalength) {
    data.frame(measurement = runif(1:datalength),
               subject=rep(c("dog","cat"),each =datalength/2),
               month=rep(seq(datalength/2,1,by=-1)),
               cumulative=rep(NA_real_, datalength))
})

library(microbenchmark)

然后

> microbenchmark(f0(df[[1]]), f0(df[[2]]), f0(df[[3]]), f0(df[[4]]))
Unit: microseconds
        expr        min          lq        mean      median          uq
 f0(df[[1]])    503.076    523.5275    576.4077    574.7825    612.9585
 f0(df[[2]])   2701.103   2769.3830   2869.0045   2847.1190   2922.0120
 f0(df[[3]])  26673.878  27184.7980  27894.5087  27547.5595  28595.6775
 f0(df[[4]]) 283416.456 285104.5225 292142.5274 290043.3785 295415.6995
        max neval
    913.945   100
   3296.594   100
  35015.903   100
 342556.407   100