如何优化 R 中的 sapply 以计算数据帧上的 运行 总数
How do I optimize sapply in R to calculate running totals on a dataframe
我在 R 中编写了一个函数来按月份计算累计总数,但是随着数据集变大,我的方法的执行时间呈指数级增长。我是 R 程序员新手,你能帮我提高效率吗?
函数和调用函数的方式:
accumulate <- function(recordnum,df){
sumthese <- (df$subject == df$subject[recordnum]) &
(df$month <= df$month[recordnum])
sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))
输入数据帧:
> df
measurement subject month
1 0.4577418 dog 5
2 0.7191123 dog 4
3 0.9346722 dog 3
4 0.2554288 dog 2
5 0.4622928 dog 1
6 0.9400145 cat 5
7 0.9782264 cat 4
8 0.1174874 cat 3
9 0.4749971 cat 2
10 0.5603327 cat 1
输出数据帧:
> df
measurement subject month cumulative
1 0.9148060 dog 5 3.6102141
2 0.9370754 dog 4 2.6954081
3 0.2861395 dog 3 1.7583327
4 0.8304476 dog 2 1.4721931
5 0.6417455 dog 1 0.6417455
6 0.5190959 cat 5 2.7524079
7 0.7365883 cat 4 2.2333120
8 0.1346666 cat 3 1.4967237
9 0.6569923 cat 2 1.3620571
10 0.7050648 cat 1 0.7050648
请注意,累积列显示了截至并包括当前月份的所有测量值的累积。该函数不需要对数据框进行排序。当 datalength 等于 100 时,经过的时间为 0.3。 1000 是 0.58。 10,000 = 27.72。我需要这个 运行 200K+ 记录。
谢谢!
与其使用自定义函数,不如使用内置的 R 函数 by
和 cumsum
?
df <- df[order(df$subject,df$month),]
df <- cbind(df,
cumulative=do.call(what=c,
args=by(data=df$measurement,
INDICES=df$subject,
FUN=cumsum)))
print(df)
measurement subject month cumulative
10 0.7050648 cat 1 0.7050648
9 0.6569923 cat 2 1.3620571
8 0.1346666 cat 3 1.4967237
7 0.7365883 cat 4 2.2333120
6 0.5190959 cat 5 2.7524079
5 0.6417455 dog 1 0.6417455
4 0.8304476 dog 2 1.4721931
3 0.2861395 dog 3 1.7583327
2 0.9370754 dog 4 2.6954081
1 0.9148060 dog 5 3.6102141
cumsum
创建累积和,by
允许您进行分组处理(返回列表 - 另一种选择是 aggreagate
,它为您提供数据框)。只要数据排序正确,这就会为您提供正确的数据。
dplyr
会让这变得非常简单
library(dplyr)
df %>%
group_by(subject) %>%
arrange(month) %>%
mutate(cumulative = cumsum(measurement))
Source: local data frame [10 x 4]
Groups: subject
measurement subject month cumulative
1 0.7050648 cat 1 0.7050648
2 0.6569923 cat 2 1.3620571
3 0.1346666 cat 3 1.4967237
4 0.7365883 cat 4 2.2333120
5 0.5190959 cat 5 2.7524079
6 0.6417455 dog 1 0.6417455
7 0.8304476 dog 2 1.4721931
8 0.2861395 dog 3 1.7583327
9 0.9370754 dog 4 2.6954081
10 0.9148060 dog 5 3.6102141
虽然如果您正在寻找绝对性能,您可能想要使用 data.table
library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]
# measurement subject month cumulative
# 1: 0.7050648 cat 1 0.7050648
# 2: 0.6569923 cat 2 1.3620571
# 3: 0.1346666 cat 3 1.4967237
# 4: 0.7365883 cat 4 2.2333120
# 5: 0.5190959 cat 5 2.7524079
# 6: 0.6417455 dog 1 0.6417455
# 7: 0.8304476 dog 2 1.4721931
# 8: 0.2861395 dog 3 1.7583327
# 9: 0.9370754 dog 4 2.6954081
# 10: 0.9148060 dog 5 3.6102141
这是非破坏性的,即原始的df
没有被修改。没有使用包。 df
行的原始顺序被保留;但是,如果这不重要,则可以省略最后一行的 [order(o), ]
。
o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]
给予:
measurement subject month cumulative
1 0.37955924 dog 5 2.2580530
2 0.43577158 dog 4 1.8784938
3 0.03743103 dog 3 1.4427222
4 0.97353991 dog 2 1.4052912
5 0.43175125 dog 1 0.4317512
6 0.95757660 cat 5 4.0751151
7 0.88775491 cat 4 3.1175385
8 0.63997877 cat 3 2.2297836
9 0.97096661 cat 2 1.5898048
10 0.61883821 cat 1 0.6188382
此函数采用测量值和月份向量,计算出如何按月对数据进行排序,然后计算按月排序的测量值的累计和,返回原始顺序(使用 (x[o])[order(o)] == x)
)
FUN <- function(measure, month) {
o <- order(month)
cumsum(measure[o])[order(o)]
}
因此,如果您要根据主题将测量值和月份拆分到一个列表中,您可以将每个元素从旧值映射到新值
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
假设隐含的 'geometry' 是一致的,split()<-
会进行簿记以将值列表分配到它们在向量中的正确位置
df$cumulative <- NA_real_ # or add this column to df's construction
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
我认为到目前为止,这是唯一保留数据原始顺序的解决方案(大概可以将步骤添加到其他解决方案...)
这似乎是线性扩展的,至少随着行数变大
f0 <- function(df) {
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
df
}
df <- lapply(10^(3:6), function(datalength) {
data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)),
cumulative=rep(NA_real_, datalength))
})
library(microbenchmark)
然后
> microbenchmark(f0(df[[1]]), f0(df[[2]]), f0(df[[3]]), f0(df[[4]]))
Unit: microseconds
expr min lq mean median uq
f0(df[[1]]) 503.076 523.5275 576.4077 574.7825 612.9585
f0(df[[2]]) 2701.103 2769.3830 2869.0045 2847.1190 2922.0120
f0(df[[3]]) 26673.878 27184.7980 27894.5087 27547.5595 28595.6775
f0(df[[4]]) 283416.456 285104.5225 292142.5274 290043.3785 295415.6995
max neval
913.945 100
3296.594 100
35015.903 100
342556.407 100
我在 R 中编写了一个函数来按月份计算累计总数,但是随着数据集变大,我的方法的执行时间呈指数级增长。我是 R 程序员新手,你能帮我提高效率吗?
函数和调用函数的方式:
accumulate <- function(recordnum,df){
sumthese <- (df$subject == df$subject[recordnum]) &
(df$month <= df$month[recordnum])
sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))
输入数据帧:
> df
measurement subject month
1 0.4577418 dog 5
2 0.7191123 dog 4
3 0.9346722 dog 3
4 0.2554288 dog 2
5 0.4622928 dog 1
6 0.9400145 cat 5
7 0.9782264 cat 4
8 0.1174874 cat 3
9 0.4749971 cat 2
10 0.5603327 cat 1
输出数据帧:
> df
measurement subject month cumulative
1 0.9148060 dog 5 3.6102141
2 0.9370754 dog 4 2.6954081
3 0.2861395 dog 3 1.7583327
4 0.8304476 dog 2 1.4721931
5 0.6417455 dog 1 0.6417455
6 0.5190959 cat 5 2.7524079
7 0.7365883 cat 4 2.2333120
8 0.1346666 cat 3 1.4967237
9 0.6569923 cat 2 1.3620571
10 0.7050648 cat 1 0.7050648
请注意,累积列显示了截至并包括当前月份的所有测量值的累积。该函数不需要对数据框进行排序。当 datalength 等于 100 时,经过的时间为 0.3。 1000 是 0.58。 10,000 = 27.72。我需要这个 运行 200K+ 记录。
谢谢!
与其使用自定义函数,不如使用内置的 R 函数 by
和 cumsum
?
df <- df[order(df$subject,df$month),]
df <- cbind(df,
cumulative=do.call(what=c,
args=by(data=df$measurement,
INDICES=df$subject,
FUN=cumsum)))
print(df)
measurement subject month cumulative
10 0.7050648 cat 1 0.7050648
9 0.6569923 cat 2 1.3620571
8 0.1346666 cat 3 1.4967237
7 0.7365883 cat 4 2.2333120
6 0.5190959 cat 5 2.7524079
5 0.6417455 dog 1 0.6417455
4 0.8304476 dog 2 1.4721931
3 0.2861395 dog 3 1.7583327
2 0.9370754 dog 4 2.6954081
1 0.9148060 dog 5 3.6102141
cumsum
创建累积和,by
允许您进行分组处理(返回列表 - 另一种选择是 aggreagate
,它为您提供数据框)。只要数据排序正确,这就会为您提供正确的数据。
dplyr
会让这变得非常简单
library(dplyr)
df %>%
group_by(subject) %>%
arrange(month) %>%
mutate(cumulative = cumsum(measurement))
Source: local data frame [10 x 4]
Groups: subject
measurement subject month cumulative
1 0.7050648 cat 1 0.7050648
2 0.6569923 cat 2 1.3620571
3 0.1346666 cat 3 1.4967237
4 0.7365883 cat 4 2.2333120
5 0.5190959 cat 5 2.7524079
6 0.6417455 dog 1 0.6417455
7 0.8304476 dog 2 1.4721931
8 0.2861395 dog 3 1.7583327
9 0.9370754 dog 4 2.6954081
10 0.9148060 dog 5 3.6102141
虽然如果您正在寻找绝对性能,您可能想要使用 data.table
library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]
# measurement subject month cumulative
# 1: 0.7050648 cat 1 0.7050648
# 2: 0.6569923 cat 2 1.3620571
# 3: 0.1346666 cat 3 1.4967237
# 4: 0.7365883 cat 4 2.2333120
# 5: 0.5190959 cat 5 2.7524079
# 6: 0.6417455 dog 1 0.6417455
# 7: 0.8304476 dog 2 1.4721931
# 8: 0.2861395 dog 3 1.7583327
# 9: 0.9370754 dog 4 2.6954081
# 10: 0.9148060 dog 5 3.6102141
这是非破坏性的,即原始的df
没有被修改。没有使用包。 df
行的原始顺序被保留;但是,如果这不重要,则可以省略最后一行的 [order(o), ]
。
o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]
给予:
measurement subject month cumulative
1 0.37955924 dog 5 2.2580530
2 0.43577158 dog 4 1.8784938
3 0.03743103 dog 3 1.4427222
4 0.97353991 dog 2 1.4052912
5 0.43175125 dog 1 0.4317512
6 0.95757660 cat 5 4.0751151
7 0.88775491 cat 4 3.1175385
8 0.63997877 cat 3 2.2297836
9 0.97096661 cat 2 1.5898048
10 0.61883821 cat 1 0.6188382
此函数采用测量值和月份向量,计算出如何按月对数据进行排序,然后计算按月排序的测量值的累计和,返回原始顺序(使用 (x[o])[order(o)] == x)
)
FUN <- function(measure, month) {
o <- order(month)
cumsum(measure[o])[order(o)]
}
因此,如果您要根据主题将测量值和月份拆分到一个列表中,您可以将每个元素从旧值映射到新值
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
假设隐含的 'geometry' 是一致的,split()<-
会进行簿记以将值列表分配到它们在向量中的正确位置
df$cumulative <- NA_real_ # or add this column to df's construction
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
我认为到目前为止,这是唯一保留数据原始顺序的解决方案(大概可以将步骤添加到其他解决方案...)
这似乎是线性扩展的,至少随着行数变大
f0 <- function(df) {
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
df
}
df <- lapply(10^(3:6), function(datalength) {
data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)),
cumulative=rep(NA_real_, datalength))
})
library(microbenchmark)
然后
> microbenchmark(f0(df[[1]]), f0(df[[2]]), f0(df[[3]]), f0(df[[4]]))
Unit: microseconds
expr min lq mean median uq
f0(df[[1]]) 503.076 523.5275 576.4077 574.7825 612.9585
f0(df[[2]]) 2701.103 2769.3830 2869.0045 2847.1190 2922.0120
f0(df[[3]]) 26673.878 27184.7980 27894.5087 27547.5595 28595.6775
f0(df[[4]]) 283416.456 285104.5225 292142.5274 290043.3785 295415.6995
max neval
913.945 100
3296.594 100
35015.903 100
342556.407 100