mclapply 根据核心 ID 遇到错误？

Question

我有一组基因需要并行计算一些系数。系数在 GeneTo_GeneCoeffs_filtered 内计算，将基因名称作为输入，return 是 2 个数据帧的列表。

长度为 100 gene_array 我运行此命令具有不同数量的内核：5、6 和 7。

Coeffslist=mclapply(gene_array,GeneTo_GeneCoeffs_filtered,mc.cores = no_cores)

根据分配给 mclapply 的核心数，我遇到了不同基因名称的错误。

GeneTo_GeneCoeffs_filtered 不能 return 其具有模式的数据框列表的基因索引。在分配给 mclapply 的 7 个核心的情况下，它是 gene_array 的 4、11、18、25、... 95 个元素（每 7 个），而当 R 使用 6 个核心时，索引是 2、8、14 ,..., 98（每 6 个）和 5 个核心的相同方式 - 每 5 个。

最重要的是它们对于这些过程是不同的，这意味着问题不在于特定的基因。

我怀疑可能 "broken" 核心无法正确运行我的功能，只有它会产生此错误。有没有办法追溯它的 id 并将它从 R 可以使用的核心列表中排除？

Answer 1

仔细阅读 mclapply 的 manpage 会发现此行为是设计使然的，它是以下交互作用的结果：

(一)

"the input X is split into as many parts as there are cores (currently the values are spread across the cores sequentially, i.e. first value to core 1, second to core 2, ... (core + 1)-th value to core 1 etc.) and then one process is forked to each core and the results are collected."

(b)

a "try-error" object will be returned for all the values involved in the failure, even if not all of them failed.

在您的情况下，由于 (a)，您的 gene_array 分布在核心中 "round-robin" 样式（连续元素的索引之间有 mc.cores 的间隙），并且凭借 (b)，如果任何 gene_array 元素引发错误，您会为发送到该核心的每个 gene_array 元素返回一个错误（与 mc.cores 之间的间隙这些元素的索引）。

我在昨天与 Simon Urbanek 的一次交流中刷新了我对此的理解：https://stat.ethz.ch/pipermail/r-sig-hpc/2019-September/002098.html 其中我还提供了一种错误处理方法，仅针对产生错误的索引产生错误。

您还可以通过传递 mc.preschedule=FALSE.

仅获取生成错误的索引的错误

mclapply 根据核心 ID 遇到错误？

mclapply encounters errors depending on core id?

parallel-processing

r

mclapply