在 EC2 实例中使用 Dask 抛出 "Couldn't gather 1 keys... "
Using Dask in EC2 instances throws "Couldn't gather 1 keys... "
我启动了几个 EC2 实例,安装了带有 conda 的 dask,在各自的实例中启动了调度程序和工作程序,并且调度程序能够接收来自工作程序的连接。但是,在启动客户端并收集结果(例如 x.result()
)后会抛出错误
WARNING - Couldn't gather 1 keys, rescheduling and the connection between scheduler and worker is terminated.
这与本期 2095 and fixed in 1278 中的错误几乎相同。不幸的是,很清楚如何使用新标志解决问题。
这是我的会话的样子:
调度程序 - 终端
>>> from dask.distributed import Client
>>> client = Client('<domain-scheduler>:8786')
>>> def inc(x):
... return x + 1
...
>>> x = client.submit(inc, 10)
>>> x.result()
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'inc-17ff1aa09aeed9c364fc31df7522511e': ('tcp://172.30.3.63:38971',)}
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/site-packages/distributed/client.py", line 190, in result
raiseit=False)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/site-packages/distributed/client.py", line 652, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/site-packages/distributed/utils.py", line 273, in sync
e.wait(10)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/threading.py", line 614, in wait
self.__cond.wait(timeout)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt
调度器 - dask-scheduler
(dask-env) ubuntu@ip-172-30-3-136:~$ dask-scheduler --host <domain-scheduler>:8786 --bokeh-port 8080
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.30.3.136:8786
distributed.scheduler - INFO - bokeh at: 172.30.3.136:8080
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-TX9nqO
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://172.30.3.63:38971
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.30.3.63:38971
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-b5d903b5-8620-11e8-8a4c-06a866fbd474
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove worker tcp://172.30.3.63:38971
distributed.core - INFO - Removing comms to tcp://172.30.3.63:38971
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://172.30.3.63:38971'], inc-17ff1aa09aeed9c364fc31df7522511e
None
^Cdistributed.scheduler - INFO - End scheduler at u'tcp://<domain>:8786'
Worker - dask-worker
(dask-env) ubuntu@ip-172-30-3-63:~$ dask-worker --host <domain-worker>:8786 <domain-scheduler>:8786
distributed.nanny - INFO - Start Nanny at: 'tcp://172.30.3.63:8786'
distributed.worker - INFO - Start worker at: tcp://172.30.3.63:38971
distributed.worker - INFO - Listening to: tcp://172.30.3.63:38971
distributed.worker - INFO - bokeh at: 172.30.3.63:8789
distributed.worker - INFO - nanny at: 172.30.3.63:8786
distributed.worker - INFO - Waiting to connect to: tcp://<domain-schedule>:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.04 GB
distributed.worker - INFO - Local Directory: /home/ubuntu/dask-worker-space/worker-EnKL22
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://<domain-scheduler>:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://172.30.3.63:38971
distributed.worker - WARNING - Heartbeat to scheduler failed
distributed.nanny - INFO - Closing Nanny at 'tcp://172.30.3.63:8786'
distributed.dask_worker - INFO - End worker
如您所见,会话在 运行 x.result()
后终止。我还尝试包含 --listen-address
、--contact-address
但没有成功。
我过去遇到过这个问题是因为调度程序无法联系到工作人员。如果您从调度程序 运行 curl <domain-worker>:8789
返回散景 html?我猜这不是您需要更改 AWS 中的网络设置。
解决方案是为 dask-scheduler
和 dask-worker
提供特定的开放端口,而不是允许它们 select 其他随机端口。命令应如下所示:
调度程序
dask-scheduler --host <domain-scheduler> --port 8786 --bokeh-port <open-port>
工人
dask-worker --host <domain-worker> <domain-scheduler>:8786 --worker-port 8786
航站楼
client = Client('tcp://<domain-scheduler>:8786')
我启动了几个 EC2 实例,安装了带有 conda 的 dask,在各自的实例中启动了调度程序和工作程序,并且调度程序能够接收来自工作程序的连接。但是,在启动客户端并收集结果(例如 x.result()
)后会抛出错误
WARNING - Couldn't gather 1 keys, rescheduling and the connection between scheduler and worker is terminated.
这与本期 2095 and fixed in 1278 中的错误几乎相同。不幸的是,很清楚如何使用新标志解决问题。
这是我的会话的样子:
调度程序 - 终端
>>> from dask.distributed import Client
>>> client = Client('<domain-scheduler>:8786')
>>> def inc(x):
... return x + 1
...
>>> x = client.submit(inc, 10)
>>> x.result()
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'inc-17ff1aa09aeed9c364fc31df7522511e': ('tcp://172.30.3.63:38971',)}
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/site-packages/distributed/client.py", line 190, in result
raiseit=False)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/site-packages/distributed/client.py", line 652, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/site-packages/distributed/utils.py", line 273, in sync
e.wait(10)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/threading.py", line 614, in wait
self.__cond.wait(timeout)
File "/home/ubuntu/anaconda2/envs/dask-env/lib/python2.7/threading.py", line 359, in wait
_sleep(delay)
KeyboardInterrupt
调度器 - dask-scheduler
(dask-env) ubuntu@ip-172-30-3-136:~$ dask-scheduler --host <domain-scheduler>:8786 --bokeh-port 8080
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.30.3.136:8786
distributed.scheduler - INFO - bokeh at: 172.30.3.136:8080
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-TX9nqO
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://172.30.3.63:38971
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.30.3.63:38971
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-b5d903b5-8620-11e8-8a4c-06a866fbd474
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Remove worker tcp://172.30.3.63:38971
distributed.core - INFO - Removing comms to tcp://172.30.3.63:38971
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://172.30.3.63:38971'], inc-17ff1aa09aeed9c364fc31df7522511e
None
^Cdistributed.scheduler - INFO - End scheduler at u'tcp://<domain>:8786'
Worker - dask-worker
(dask-env) ubuntu@ip-172-30-3-63:~$ dask-worker --host <domain-worker>:8786 <domain-scheduler>:8786
distributed.nanny - INFO - Start Nanny at: 'tcp://172.30.3.63:8786'
distributed.worker - INFO - Start worker at: tcp://172.30.3.63:38971
distributed.worker - INFO - Listening to: tcp://172.30.3.63:38971
distributed.worker - INFO - bokeh at: 172.30.3.63:8789
distributed.worker - INFO - nanny at: 172.30.3.63:8786
distributed.worker - INFO - Waiting to connect to: tcp://<domain-schedule>:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.04 GB
distributed.worker - INFO - Local Directory: /home/ubuntu/dask-worker-space/worker-EnKL22
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://<domain-scheduler>:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://172.30.3.63:38971
distributed.worker - WARNING - Heartbeat to scheduler failed
distributed.nanny - INFO - Closing Nanny at 'tcp://172.30.3.63:8786'
distributed.dask_worker - INFO - End worker
如您所见,会话在 运行 x.result()
后终止。我还尝试包含 --listen-address
、--contact-address
但没有成功。
我过去遇到过这个问题是因为调度程序无法联系到工作人员。如果您从调度程序 运行 curl <domain-worker>:8789
返回散景 html?我猜这不是您需要更改 AWS 中的网络设置。
解决方案是为 dask-scheduler
和 dask-worker
提供特定的开放端口,而不是允许它们 select 其他随机端口。命令应如下所示:
调度程序
dask-scheduler --host <domain-scheduler> --port 8786 --bokeh-port <open-port>
工人
dask-worker --host <domain-worker> <domain-scheduler>:8786 --worker-port 8786
航站楼
client = Client('tcp://<domain-scheduler>:8786')