针对三元数据帧的行操作优化 R 代码

Optimize R code for row operations on ternary data frame

问题

我有这个功能,我需要让它运行得更快:)

if (length(vec) == 0) { # first case
  count = sum(apply(df, 1, function(x) {
    all(x == 0, na.rm = T)
  }))
} else if (length(vec) == 1) { # second case
  count = sum(df[, vec], na.rm = T)
} else {
  count = sum(apply(df[, vec], 1, function(x) { # third case
    all(x == 1) }), na.rm = T)
}

dfdata.frame 只有 1、0 或 NA 值veccolnames(df).

的子向量

问题

您认为有什么方法可以使用 dplyr 或其他方式使此代码 运行 更快,因为它逐行处理数据?例如,当我将更简单的(第二种情况)- count = sum(df[, vec], na.rm = T)dplyr: sum(df %>% select(vec), na.rm = T) 交换并进行基准测试时,情况要差得多(但好吧,我不认为第二种情况使用任何方法都可以变得相当快)。

欢迎为第 2 和第 3 种情况提供任何提示或技巧!

基准测试

足够 data.frame 一起玩:df = matrix(data = sample(c(0,1,NA), size = 100000, replace = TRUE), nrow = 10000, ncol = 10)

rbenchmark::benchmark("prev" = {sum(apply(df, 1, function(x) {all(x == 0, na.rm = T)}))}, "new-long" = {sum((rowSums(df == 0, na.rm = TRUE) + rowSums(is.na(df)) == ncol(df)))}, "new-short" = {sum(!rowSums(df != 0, na.rm = TRUE))}, replications = 1000, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))

结果:

       test replications elapsed relative user.self sys.self
2  new-long         1000   1.267    1.412     1.267        0
3 new-short         1000   0.897    1.000     0.897        0
1      prev         1000  11.857   13.219    11.859        0
rbenchmark::benchmark("prev" = {sum(apply(df[, vec], 1, function(x) { all(x == 1) }), na.rm = T)}, "new" = {sum(!rowSums(replace(df[, vec], is.na(df[, vec]), -999) != 1))}, replications = 1000, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))

结果:

test replications elapsed relative user.self sys.self
2  new         1000   0.179    1.000     0.175    0.004
1 prev         1000   2.219   12.397     2.219    0.000

总体而言,使用 rowSums 的加速效果不错!也用它代替 apply!

对于第一种和第三种情况,这里有一个使用 rowSums 优化代码的选项。由于当行值为 NA 时会出现边缘情况,一种选择是用不在数据集中的值替换这些值,创建一个逻辑矩阵,使用 rowSums 将其转换为逻辑 vector 并获得 sumTRUE

sum((rowSums(df == 0, na.rm = TRUE) + rowSums(is.na(df)) == ncol(df)))

sum(!rowSums(df != 0, na.rm = TRUE))
sum(!rowSums(replace(df[, vec], is.na(df[, vec]), -999) != 1))