在 `data.table` 中使用时 `Map()` 是并行的吗？ -R

Question

来自 data.table 包 website，假设：

"many common operations are internally parallelized to use multiple CPU threads"

我想知道在 data.table 中使用 Map() 时是否属于这种情况？

问的原因是因为我注意到比较大型数据集上的相同操作（cor.test(x, y) 与 x = .SD 和 y 是数据集的单列），使用 Map() 的执行速度比使用 furrr::fututre_map2() 时快。

Answer 1

您可以使用这种颇具探索性的方法，看看使用更多线程时经过的时间是否会缩短。请注意，在我的机器上，可用线程的最大数量只有一个，因此不可能存在差异

library(data.table)

dt <- data.table::data.table(a = 1:3,
                             b = 4:6)
dt
#>    a b
#> 1: 1 4
#> 2: 2 5
#> 3: 3 6

data.table::getDTthreads()
#> [1] 1

# No Prallelisation ----------------------------------
data.table::setDTthreads(1)
system.time({
  
  dt[, lapply(.SD,
              function(x) {
                Sys.sleep(2)
                x}
  )
  ]
})
#>    user  system elapsed 
#>   0.009   0.001   4.017

# Parallel -------------------------------------------
# use multiple threads
data.table::setDTthreads(2)
data.table::getDTthreads()
#> [1] 1

# if parallel, elapsed should be below 4
system.time({
  
  dt[, lapply(.SD,
              function(x) {
                Sys.sleep(2)
                x}
  )
  ]
})
#>    user  system elapsed 
#>   0.001   0.000   4.007

# Map -----------------------------------------------
# if parallel, elapsed should be below 4
system.time({
  
  dt[, Map(f = function(x, y) {
    Sys.sleep(2)
    x},
    .SD,
    1:2
    
  )
  ]
})
#>    user  system elapsed 
#>   0.002   0.000   4.005

在 `data.table` 中使用时 `Map()` 是并行的吗？ -R

Is `Map()` when used in a `data.table` parallel? - R

r

data.table

furrr