dask:如何避免任务超时?
dask: How do I avoid timeout for a task?
在我的基于 dask 的应用程序(使用 distributed
调度程序)中,我看到以以下错误文本开头的故障:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
它们之后是第二个回溯(我认为)指示发生超时时我的任务是 运行 的哪一行。 (我不清楚 distributed
是如何做到这一点的——也许是通过信号?)
这是第二次回溯的粗略部分:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
after timeout
是否表示任务耗时太长,或者是否有其他超时触发取消,例如保姆或心跳超时? (据我所知,dask 中的任务长度没有明确的超时,但也许我很困惑。)
我看到任务被取消了。但我想知道为什么。有什么简单的方法可以找出哪一行代码(在 dask
或 distributed
中)取消了我的任务,为什么?
我预计这些任务需要很长时间——它们正在将大缓冲区上传到云存储。如何在 dask 中增加特定任务的超时时间?
默认情况下,Dask 不会对任务施加超时。
你看到的取消的未来不是 Dask 的未来,而是 Tornado 的未来(Tornado 是 Dask 用于网络通信的库)。所以不幸的是,这一切都在说一些事情失败了。
随后的回溯有望包含有关失败代码的确切信息。理想情况下,这指向函数中发生故障的一行。也许这有帮助?
通常我们在通过 Dask 调试代码 运行 时推荐这些步骤:http://docs.dask.org/en/latest/debugging.html
在我的基于 dask 的应用程序(使用 distributed
调度程序)中,我看到以以下错误文本开头的故障:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
它们之后是第二个回溯(我认为)指示发生超时时我的任务是 运行 的哪一行。 (我不清楚 distributed
是如何做到这一点的——也许是通过信号?)
这是第二次回溯的粗略部分:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
after timeout
是否表示任务耗时太长,或者是否有其他超时触发取消,例如保姆或心跳超时? (据我所知,dask 中的任务长度没有明确的超时,但也许我很困惑。)我看到任务被取消了。但我想知道为什么。有什么简单的方法可以找出哪一行代码(在
dask
或distributed
中)取消了我的任务,为什么?我预计这些任务需要很长时间——它们正在将大缓冲区上传到云存储。如何在 dask 中增加特定任务的超时时间?
默认情况下,Dask 不会对任务施加超时。
你看到的取消的未来不是 Dask 的未来,而是 Tornado 的未来(Tornado 是 Dask 用于网络通信的库)。所以不幸的是,这一切都在说一些事情失败了。
随后的回溯有望包含有关失败代码的确切信息。理想情况下,这指向函数中发生故障的一行。也许这有帮助?
通常我们在通过 Dask 调试代码 运行 时推荐这些步骤:http://docs.dask.org/en/latest/debugging.html