分布式 Julia 中的弹性并行和容错

Question

Julia 如何提供容错能力 - 当一个节点（有意或无意）宕机以及节点之间的通信宕机时。

我看到了 few mentions 这样的功能，但无法找到具体的实现方式。

Answer 1

在 pmap 文档字符串中，您可以看到已经使用 retry_ 关键字参数实现了这一点。

pmap([::AbstractWorkerPool], f, c...; distributed=true, batch_size=1,
on_error=nothing, retry_n=0, retry_max_delay=DEFAULT_RETRY_MAX_DELAY,
retry_on=DEFAULT_RETRY_ON) -> collection

... Any error stops pmap from processing the remainder of the collection. To override this behavior you can specify an error handling function via argument on_error which takes in a single argument, i.e., the exception. The function can stop the processing by rethrowing the error, or, to continue, return any value which is then returned inline with the results to the caller.

Failed computation can also be retried via retry_on, retry_n, retry_max_delay, which are passed through to retry as arguments retry_on, n and max_delay respectively. If batching is specified, and an entire batch fails, all items in the batch are retried.

我认为 @parallel 宏没有这样的东西。但是您可以使用 Base.wrap_on_error & Base.wrap_retry 函数来扩展您原来的函数来处理错误。通过在 https://github.com/JuliaLang/julia/blob/v0.5.0/base/pmap.jl.

查看 pmap 的定义，您可以看到很多实现细节

基本策略只是捕获错误（可能还有数据）并使用同一个 worker 重试，如果那个 worker 挂掉了，则使用另一个 worker 重试。我觉得。

分布式 Julia 中的弹性并行和容错

elastic parallelism and fault-tolerance in distributed Julia

parallel-processing

fault-tolerance

julia