工作进程在 0 秒后仍然存在,正在杀死
Worker process still alive after 0 seconds, killing
我用我的调度程序 (PBS) 提交了两个 Dask 容器:
#!/usr/bin/env bash
#PBS -N MyApp
#PBS -q my_queue
#PBS -l select=1:ncpus=1:mem=2GB
#PBS -l walltime=00:30:00
#PBS -m n
/.../bin/python -m distributed.cli.dask_worker tcp://scheduler:53815 --nanny --death-timeout 60
第一个 worker 成功连接到调度程序:
distributed.nanny - INFO - Start Nanny at: 'tcp://...:48652'
distributed.worker - INFO - Start worker at: tcp://...:33401
distributed.worker - INFO - Listening to: tcp://...:33401
distributed.worker - INFO - dashboard at: ...:54725
distributed.worker - INFO - Waiting to connect to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.86 GiB
distributed.worker - INFO - Local Directory: /.../
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://...:48652'
Terminated
(信号 15 可以。对于 REDHAT 来说,它意味着一个简单的 SIGTERM,因为我在容器结束之前已经终止了它自己)
第二个工人的问题:
worker的容器没问题,但是worker从不处理任何Dask任务。
日志如下:
distributed.nanny - INFO - Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/.../site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/.../asyncio/tasks.py", line 468, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/.../site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/.../asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/.../runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/.../site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/.../site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/.../site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.../site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/.../site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds
如您所见,第二个工人似乎从来没有listen
。它只做 nanny
相关的事情。
你有什么想法,为什么第二个工人从不放弃?
谢谢
编辑:
我和HtCondor
有同样的错误:
distributed.nanny - INFO - Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/asyncio/tasks.py", line 466, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/asyncio/tasks.py", line 490, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/site-packages/click/core.py", line 1126, in __call__
return self.main(*args, **kwargs)
File "/site-packages/click/core.py", line 1051, in main
rv = self.invoke(ctx)
File "/site-packages/click/core.py", line 1393, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/site-packages/click/core.py", line 752, in invoke
return __callback(*args, **kwargs)
File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/asyncio/tasks.py", line 688, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
它与传递给任何 dask-worker
的 --no-dashboard
选项一起使用
https://github.com/dask/dask-jobqueue/issues/391#issuecomment-639257428
我用我的调度程序 (PBS) 提交了两个 Dask 容器:
#!/usr/bin/env bash
#PBS -N MyApp
#PBS -q my_queue
#PBS -l select=1:ncpus=1:mem=2GB
#PBS -l walltime=00:30:00
#PBS -m n
/.../bin/python -m distributed.cli.dask_worker tcp://scheduler:53815 --nanny --death-timeout 60
第一个 worker 成功连接到调度程序:
distributed.nanny - INFO - Start Nanny at: 'tcp://...:48652'
distributed.worker - INFO - Start worker at: tcp://...:33401
distributed.worker - INFO - Listening to: tcp://...:33401
distributed.worker - INFO - dashboard at: ...:54725
distributed.worker - INFO - Waiting to connect to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.86 GiB
distributed.worker - INFO - Local Directory: /.../
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://...:48652'
Terminated
(信号 15 可以。对于 REDHAT 来说,它意味着一个简单的 SIGTERM,因为我在容器结束之前已经终止了它自己)
第二个工人的问题:
worker的容器没问题,但是worker从不处理任何Dask任务。
日志如下:
distributed.nanny - INFO - Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/.../site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/.../asyncio/tasks.py", line 468, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/.../site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/.../asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/.../runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/.../site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/.../site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/.../site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.../site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/.../site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds
如您所见,第二个工人似乎从来没有listen
。它只做 nanny
相关的事情。
你有什么想法,为什么第二个工人从不放弃?
谢谢
编辑:
我和HtCondor
有同样的错误:
distributed.nanny - INFO - Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/asyncio/tasks.py", line 466, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/asyncio/tasks.py", line 490, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/site-packages/click/core.py", line 1126, in __call__
return self.main(*args, **kwargs)
File "/site-packages/click/core.py", line 1051, in main
rv = self.invoke(ctx)
File "/site-packages/click/core.py", line 1393, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/site-packages/click/core.py", line 752, in invoke
return __callback(*args, **kwargs)
File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/asyncio/tasks.py", line 688, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
它与传递给任何 dask-worker
--no-dashboard
选项一起使用
https://github.com/dask/dask-jobqueue/issues/391#issuecomment-639257428