R lapply 向量子集的循环性能

Question

大家好！我很少问新问题，因为这个论坛上已经说了很多，但这次我找不到足够的 material 来解决我的性能问题。

我基本上有调查数据，我想从中计算品牌级别的各种指标。诀窍是我需要通过排除与测试的第 i 个元素相关的所有元素来为循环的每个元素创建向量的变体。目前我还没有找到一种方法来向量化我的代码。因此，我的 lapply 循环非常慢（到目前为止，是更大脚本中最慢的部分）。

我的数据集有 800 万行，我循环了 70 多个品牌，所以此时性能开始变得重要。请参阅您自己的测试的较短的可重现示例：

（编辑：添加到脚本中的注释以便更好地理解。）

# Small sample size to experiment
N <- 200L 

# Table with survey data :
# - each observation is the answer of a person about a brand
# - each observation is associated to a weight, used to compute statistics (frequencies, means...)
# - each person is the described by few socio-demographic variables (country, gender, age)
# - brands are given a grade ('score' variable), ranging from 0 to 10
repex_DT <- data.table (
  country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
  gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
  age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
  brand = factor(sample(c(toupper(letters[1:26])), size = N, replace=TRUE)),
  score = sample(x = c(0:10), size = N, replace=TRUE),
  weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)

# The loop computes for each "country x gender x age_group x brand" combination :
# - the "country x gender x age_group" socio-demographic group size (cases_total, i.e. all brands included)
# - the "country x gender x age_group" group size, excluding the people assessing the current 'brand_' argument
# - Same logic for mean and standard deviation indicators

current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
  
  # Calculations done for each 'brand_' of each "country x gender x age_group" combination
  out <- repex_DT[ , .(
    cases_total = sum(x=weight, na.rm=TRUE),
    cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
    mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
    sd_others  = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
  ), by = .(country, gender, age_group)]
  
  out[, brand := brand_]
  data.table::setcolorder(x=out, neworder="brand")
  return(data.table::copy(x=out))})) %>% 
  
  # Sorting at the end for better readability
  .[order(., country, gender, age_group, brand)]

到目前为止，我已经阅读了很多其他 SO 问题，例如 , this other one 和其他关于同一主题的问题，因此我很清楚扩展 data.table 的循环既耗费内存又耗时。然而，我还没有找到其他方法让我到达我想要的地方。希望你们 R 专家可以:-) 顺便说一下，我使用 expss 来计算加权均值和标准差，因为我也使用这个包在这里和那里计算表格，但如果这有助于提高性能，我当然可以使用其他包。

Answer 1

虽然没有解决矢量化问题，但使用 collapse 包可以更有效并带来不错的加速（YMMV，取决于可用内核的数量）：

invisible(suppressPackageStartupMessages(
    lapply(c("magrittr", "expss", "data.table", "collapse"), require, character.only=TRUE)))
options(datatable.optimize = 3L)
N <- 1E7
repex_DT <- data.table (
  country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
  gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
  age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
  brand = factor(sample(LETTERS, size = N, replace=TRUE)),
  score = sample(x = c(0:10), size = N, replace=TRUE),
  weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)

# your version
system.time({
    current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
  # Calculations done for each 'brand_' of each "country x gender x age_group" combination
  out <- repex_DT[ , .(
    cases_total = sum(x=weight, na.rm=TRUE),
    cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
    mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
    sd_others  = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
  ), by = .(country, gender, age_group)]
  
  out[, brand := brand_]
  data.table::setcolorder(x=out, neworder="brand")
  return(data.table::copy(x=out))})) %>% 
  
  # Sorting at the end for better readability
  .[order(., country, gender, age_group, brand)]
})
#>    user  system elapsed 
#>  95.612   1.557  49.309

# version using collapse
system.time({
cols <- c("country", "gender", "age_group")
ot <- repex_DT[ , .(cases_total = sum(weight, na.rm=TRUE)), by = cols]
ot2 <- data.table::setorder(rbindlist(lapply(setNames(levels(repex_DT$brand), levels(repex_DT$brand)), 
    \(i) repex_DT[brand != i][, .(cases_others = fsum(x=weight, na.rm=TRUE),
    mean_others = fmean(score, w=weight, na.rm=TRUE),
    sd_others = fsd(score, w=weight, na.rm=TRUE)), by = cols]), 
    idcol="brand"), country, gender, age_group, brand)
out <- data.table::setcolorder(ot[ot2, on=cols], "brand")
out[, brand := factor(brand)]
})
#>    user  system elapsed 
#>  49.836   3.478  11.543

all.equal(current_loop, out)
#> [1] TRUE

^{由 reprex package (v2.0.1)}

创建于 2022-03-07

Answer 2

这是一个更快的方法

获取总数的 n、平均值、标准差

cases_total = repex_DT[, .(cases_total = sum(weight, na.rm=T),
                           mean_total = expss::w_mean(score, weight, na.rm=T),
                           sd_total = expss::w_sd(score, weight, na.rm=T)), .(country, gender, age_group)]

获取每个品牌的 n、平均值、标准差

cases_brand = repex_DT[, .(cases_brand = sum(weight, na.rm=T),
                           mean_brand = expss::w_mean(score, weight, na.rm=T),
                           sd_brand = expss::w_sd(score, weight, na.rm=T)), .(brand,country, gender, age_group)]

将它们合并在一起

result = cases_brand[cases_total, on=.(country, gender, age_group)][, cases_others:=cases_total-cases_brand]

很容易对“其他人”（即 non-brand）变得刻薄

result[, mean_others:= (cases_total*mean_total - cases_brand*mean_brand)/cases_others]

现在，一个获取“其他”的 sd 的小函数

sd_other <- function(n1,n2,total_sd,sub_sd,total_mean, sub_mean ,other_mean) {
  sqrt(
    (total_sd^2*(n1+n2-1) - (n1-1)*sub_sd^2 - n1*(sub_mean-total_mean)^2 - n2*(other_mean-total_mean)^2)/(n2-1)
  )
}

应用该函数获取其他函数的 sd

result[, sd_others:= sd_other(cases_brand, cases_others,sd_total,sd_brand,mean_total,mean_brand, mean_others)]

删除不必要的列并设置顺序

result[, `:=`(cases_brand=NULL, mean_brand=NULL, sd_brand=NULL, mean_total=NULL, sd_total=NULL)]
setorder(result, country, gender, age_group, brand)

比较：

> microbenchmark::microbenchmark(list=cmds, times=10)
Unit: milliseconds
         expr       min        lq      mean    median        uq       max neval
 current_loop 3684.4233 3700.1134 3775.4322 3729.8387 3855.5353 3938.4605    10
 new_approach  155.9486  158.2265  164.1699  165.9736  167.5279  174.0746    10

R lapply 向量子集的循环性能

R lapply loop performance with vector subsets

performance

loops

r

data.table