工作进程在 0 秒后仍然存在,正在杀死

Worker process still alive after 0 seconds, killing

我用我的调度程序 (PBS) 提交了两个 Dask 容器:

#!/usr/bin/env bash

#PBS -N MyApp
#PBS -q my_queue
#PBS -l select=1:ncpus=1:mem=2GB
#PBS -l walltime=00:30:00
#PBS -m n

/.../bin/python -m distributed.cli.dask_worker tcp://scheduler:53815 --nanny --death-timeout 60

第一个 worker 成功连接到调度程序:

distributed.nanny - INFO -         Start Nanny at: 'tcp://...:48652'
distributed.worker - INFO -       Start worker at:    tcp://...:33401
distributed.worker - INFO -          Listening to:    tcp://...:33401
distributed.worker - INFO -          dashboard at:          ...:54725
distributed.worker - INFO - Waiting to connect to:   tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   1.86 GiB
distributed.worker - INFO -       Local Directory: /.../
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://...:48652'
Terminated

(信号 15 可以。对于 REDHAT 来说,它意味着一个简单的 SIGTERM,因为我在容器结束之前已经终止了它自己)

第二个工人的问题:

worker的容器没问题,但是worker从不处理任何Dask任务。

日志如下:

distributed.nanny - INFO -         Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/.../site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/.../asyncio/tasks.py", line 468, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/.../asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.../runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/.../site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/.../site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/.../site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/.../site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/.../site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds

如您所见,第二个工人似乎从来没有listen。它只做 nanny 相关的事情。

你有什么想法,为什么第二个工人从不放弃?

谢谢

编辑:

我和HtCondor有同样的错误:

distributed.nanny - INFO -         Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/asyncio/tasks.py", line 466, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/asyncio/tasks.py", line 490, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/asyncio/tasks.py", line 492, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/site-packages/click/core.py", line 1126, in __call__
    return self.main(*args, **kwargs)
  File "/site-packages/click/core.py", line 1051, in main
    rv = self.invoke(ctx)
  File "/site-packages/click/core.py", line 1393, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/site-packages/click/core.py", line 752, in invoke
    return __callback(*args, **kwargs)
  File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/asyncio/tasks.py", line 688, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

它与传递给任何 dask-worker

--no-dashboard 选项一起使用

https://github.com/dask/dask-jobqueue/issues/391#issuecomment-639257428