在并行处理中，select r 中包含特定关键字的所有行

Question

我的数据 (df) 包含约 2,0000K 行和约 5K 唯一名称。对于每个唯一名称，我想 select df 中包含该特定名称的所有行。例如，数据框 df 如下所示：

id  names
1   A,B,D
2   A,B
3   A
4   B,D
5   C,E
6   C,D,E
7   A,E

我想 select 列 'names' 中包含 'A' 的所有行（A 在 5K 个唯一名称中）。所以，输出将是：

id  names
1   A,B,D
2   A,B
3   A
7   A,E

我正在尝试使用节点数 = 20 和 80 GB 内存的 mclapply 并行处理来执行此操作。我仍然遇到内存不足的问题。

这是我的代码 select 包含特定名称的行：

subset_select = function(x,df){
  indx <- which(
    rowSums(
      `dim<-`(grepl(x, as.matrix(df), fixed=TRUE), dim(df))
    ) > 0
  )
  new_df = df[indx, ]
  return(new_df)
}

df_subset = subset_select(name,df)

我的问题是：有没有其他方法可以更有效地获取每个 5K 唯一名称的数据子集（在运行时和内存消耗方面）？ TIA.

Answer 1

这是包 parallel.
的并行化方式首先，数据集有 2M 行。下面的代码是为了展示它，而不是更多。请参阅 scan.

后的注释行

x <- scan(file = "~/tmp/temp.txt")
#Read 2000000 items
df1 <- data.frame(id = seq_along(x), names = x)

现在是代码。
并行化 mclapply 循环将数据分成 N 行的块，并独立处理它们。然后，return 值 inx2 必须 unlisted。

library(parallel)

ncores <- detectCores() - 1L
pat <- "A"

t1 <- system.time({
  inx1 <- grep(pat, df1$names)
})

t2 <- system.time({
  N <- 10000L
  iters <- seq_len(ceiling(nrow(df1) / N))
  inx2 <- mclapply(iters, function(k){
    i <- seq_len(N) + (k - 1L)*N
    j <- grep(pat, df1[i, "names"])
    i[j]
  }, mc.cores = ncores)
  inx2 <- unlist(inx2)
})

identical(df1[inx1, ], df1[inx2, ])  
#[1] TRUE

rbind(t1, t2)
#   user.self sys.self elapsed user.child sys.child
#t1     5.325    0.001   5.371      0.000     0.000
#t2     0.054    0.093   2.446      3.688     0.074

mclapply 所用时间不到直接 grep 所用时间的一半。
Ubuntu 20.04.3 LTS 上的 R 版本 4.1.1 (2021-08-10)。

Answer 2

如果您需要为多个“名称”重复此操作，那么 base::by() 可能有助于对数据进行预分组，例如

data <- read.table(header=TRUE, text=
"id  names
1   A,B,D
2   A,B
3   A
4   B,D
5   C,E
6   C,D,E
7   A,E
8   A
9   A,B"
)

groups <- by(data, INDICES = data$names, FUN = function(x) x$id)
print(groups)
#> data$names: A
#> [1] 3 8
#> ------------------------------------------------------------ 
#> data$names: A,B
#> [1] 2 9
#> ------------------------------------------------------------ 
#> data$names: A,B,D
#> [1] 1
#> ------------------------------------------------------------ 
#> data$names: A,E
#> [1] 7
#> ------------------------------------------------------------ 
#> data$names: B,D
#> [1] 4
#> ------------------------------------------------------------ 
#> data$names: C,D,E
#> [1] 6
#> ------------------------------------------------------------ 
#> data$names: C,E
#> [1] 5

print(groups$A)
#> [1] 3 8

然后可以找到具有 A 的所有组及其 id:s 为：

name <- "A"
groups_subset <- groups[grep(name, names(groups))]
idxs <- sort(unlist(groups_subset, use.names = FALSE))
data_subset <- data[idxs, ]
rownames(data_subset) <- NULL  ## optional
print(data_subset)
#>   id names
#> 1  1 A,B,D
#> 2  2   A,B
#> 3  3     A
#> 4  7   A,E
#> 5  8     A
#> 6  9   A,B

你觉得这样对吗？（免责声明：我是作者）如果是这样，那么您可以尝试查看使用 future.apply 及其 future_by() 是否可以帮助您运行并行；

library(future.apply)

## Run in parallel using forked ("multicore") processin
## All cores by default, otherwise add 'workers = 20' 
plan(multicore)

data <- ... as above ...

groups <- future_by(data, INDICES = data$names, FUN = function(x) x$id)

name <- "A"
groups_subset <- groups[grep(name, names(groups))]
idxs <- sort(unlist(groups_subset, use.names = FALSE))
data_subset <- data[idxs, ]
rownames(data_subset) <- NULL  ## optional
print(data_subset)
#>   id names
#> 1  1 A,B,D
#> 2  2   A,B
#> 3  3     A
#> 4  7   A,E
#> 5  8     A
#> 6  9   A,B

在并行处理中，select r 中包含特定关键字的所有行

In Parallel processing, select all the rows which contains a specific keyword in r

parallel-processing

r

out-of-memory

mclapply