Dask distributed.scheduler - 错误 - 无法收集密钥
Dask distributed.scheduler - ERROR - Couldn't gather keys
import joblib
from sklearn.externals.joblib import parallel_backend
with joblib.parallel_backend('dask'):
from dask_ml.model_selection import GridSearchCV
import xgboost
from xgboost import XGBRegressor
grid_search = GridSearchCV(estimator= XGBRegressor(), param_grid = param_grid, cv = 3, n_jobs = -1)
grid_search.fit(df2,df3)
我使用两台本地机器创建了一个 dask 集群
client = dask.distributed.client('tcp://191.xxx.xx.xxx:8786')
我正在尝试使用 dask gridsearchcv 寻找最佳参数。我遇到以下错误。
istributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1202, 2)": ['tcp://127.0.0.1:3738']} state: ['processing'] workers: ['tcp://127.0.0.1:3738']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:3738'], ('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1202, 2)
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1202, 2)": ('tcp://127.0.0.1:3738',)}
distributed.nanny - WARNING - Restarting worker
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1, 2)": ['tcp://127.0.0.1:3730']} state: ['processing'] workers: ['tcp://127.0.0.1:3730']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 0, 1)": ['tcp://127.0.0.1:3730'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 5, 1)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 4, 2)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 2, 1)": ['tcp://127.0.0.1:3730']} state: ['processing', 'processing', 'processing', 'processing'] workers: ['tcp://127.0.0.1:3730', 'tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {'cv-n-samples-7cb7087b3aff75a31f487cfe5a9cedb0': ['tcp://127.0.0.1:3729']} state: ['processing'] workers: ['tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 4, 0)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 2, 0)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 0, 0)": ['tcp://127.0.0.1:3729']} state: ['processing', 'processing', 'processing'] workers: ['tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 0, 2)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 2, 2)": ['tcp://127.0.0.1:3729']} state: ['processing', 'processing'] workers: ['tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:3730'], ('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1, 2)
NoneType: None
我希望有人能帮助解决这个问题。提前致谢。
当我厌倦了 运行 在 ec2 实例上进行本地访问时,我 运行 遇到了同样的问题。为了解决这个问题,我使用了:
from distributed import Client
from dask import config
config.set({'interface': 'lo'}) #<---found out to use 'lo' by running ifconfig in shell
client = Client()
这个问题帮助我找到了解决方案:https://github.com/dask/distributed/issues/1281
我也遇到了同样的问题,我觉得很有可能是防火墙的问题
假设我们有两台机器,191.168.1.1 用于调度器,191.168.1.2 用于工作器。
当我们启动调度程序时,我们可能会得到以下信息:
distributed.scheduler - INFO - -----------------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://191.168.1.1:8786
distributed.scheduler - INFO - dashboard at: :8787
所以对于调度器,我们应该确认port 8786
和port 8786
可以访问。
同样,我们可以查看工人的信息:
istributed.nanny - INFO - Start Nanny at: 'tcp://191.168.1.2:39042'
distributed.diskutils - INFO - Found stale lock file and directory '/root/dask-worker-space/worker-39rf_n28', purging
distributed.worker - INFO - Start worker at: tcp://191.168.1.2:39040
distributed.worker - INFO - Listening to: tcp://191.168.1.2:39040
distributed.worker - INFO - dashboard at: 191.168.1.2:39041
distributed.worker - INFO - Waiting to connect to: tcp://191.168.1.1:8786
distributed.worker - INFO - -------------------------------------------------
nanny 端口是 39042
,worker 端口是 39040
,dashboard 端口是 39041
。
为 191.168.1.1 和 191.168.1.2 设置这些端口打开:
firewall-cmd --permanent --add-port=8786/tcp
firewall-cmd --permanent --add-port=8787/tcp
firewall-cmd --permanent --add-port=39040/tcp
firewall-cmd --permanent --add-port=39041/tcp
firewall-cmd --permanent --add-port=39042/tcp
firewall-cmd --reload
任务可以运行成功。
最后,Dask
会随机选择worker的端口,我们也可以启动自定义端口的worker:
dask-worker 191.168.1.1:8786 --worker-port 39040 --dashboard-address 39041 --nanny-port 39042
更多参数可参考here.
import joblib
from sklearn.externals.joblib import parallel_backend
with joblib.parallel_backend('dask'):
from dask_ml.model_selection import GridSearchCV
import xgboost
from xgboost import XGBRegressor
grid_search = GridSearchCV(estimator= XGBRegressor(), param_grid = param_grid, cv = 3, n_jobs = -1)
grid_search.fit(df2,df3)
我使用两台本地机器创建了一个 dask 集群
client = dask.distributed.client('tcp://191.xxx.xx.xxx:8786')
我正在尝试使用 dask gridsearchcv 寻找最佳参数。我遇到以下错误。
istributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1202, 2)": ['tcp://127.0.0.1:3738']} state: ['processing'] workers: ['tcp://127.0.0.1:3738']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:3738'], ('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1202, 2)
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1202, 2)": ('tcp://127.0.0.1:3738',)}
distributed.nanny - WARNING - Restarting worker
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1, 2)": ['tcp://127.0.0.1:3730']} state: ['processing'] workers: ['tcp://127.0.0.1:3730']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 0, 1)": ['tcp://127.0.0.1:3730'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 5, 1)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 4, 2)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 2, 1)": ['tcp://127.0.0.1:3730']} state: ['processing', 'processing', 'processing', 'processing'] workers: ['tcp://127.0.0.1:3730', 'tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {'cv-n-samples-7cb7087b3aff75a31f487cfe5a9cedb0': ['tcp://127.0.0.1:3729']} state: ['processing'] workers: ['tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 4, 0)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 2, 0)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 0, 0)": ['tcp://127.0.0.1:3729']} state: ['processing', 'processing', 'processing'] workers: ['tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Couldn't gather keys {"('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 0, 2)": ['tcp://127.0.0.1:3729'], "('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 2, 2)": ['tcp://127.0.0.1:3729']} state: ['processing', 'processing'] workers: ['tcp://127.0.0.1:3729']
NoneType: None
distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://127.0.0.1:3730'], ('xgbregressor-fit-score-7cb7087b3aff75a31f487cfe5a9cedb0', 1, 2)
NoneType: None
我希望有人能帮助解决这个问题。提前致谢。
当我厌倦了 运行 在 ec2 实例上进行本地访问时,我 运行 遇到了同样的问题。为了解决这个问题,我使用了:
from distributed import Client
from dask import config
config.set({'interface': 'lo'}) #<---found out to use 'lo' by running ifconfig in shell
client = Client()
这个问题帮助我找到了解决方案:https://github.com/dask/distributed/issues/1281
我也遇到了同样的问题,我觉得很有可能是防火墙的问题
假设我们有两台机器,191.168.1.1 用于调度器,191.168.1.2 用于工作器。
当我们启动调度程序时,我们可能会得到以下信息:
distributed.scheduler - INFO - -----------------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://191.168.1.1:8786
distributed.scheduler - INFO - dashboard at: :8787
所以对于调度器,我们应该确认port 8786
和port 8786
可以访问。
同样,我们可以查看工人的信息:
istributed.nanny - INFO - Start Nanny at: 'tcp://191.168.1.2:39042'
distributed.diskutils - INFO - Found stale lock file and directory '/root/dask-worker-space/worker-39rf_n28', purging
distributed.worker - INFO - Start worker at: tcp://191.168.1.2:39040
distributed.worker - INFO - Listening to: tcp://191.168.1.2:39040
distributed.worker - INFO - dashboard at: 191.168.1.2:39041
distributed.worker - INFO - Waiting to connect to: tcp://191.168.1.1:8786
distributed.worker - INFO - -------------------------------------------------
nanny 端口是 39042
,worker 端口是 39040
,dashboard 端口是 39041
。
为 191.168.1.1 和 191.168.1.2 设置这些端口打开:
firewall-cmd --permanent --add-port=8786/tcp
firewall-cmd --permanent --add-port=8787/tcp
firewall-cmd --permanent --add-port=39040/tcp
firewall-cmd --permanent --add-port=39041/tcp
firewall-cmd --permanent --add-port=39042/tcp
firewall-cmd --reload
任务可以运行成功。
最后,Dask
会随机选择worker的端口,我们也可以启动自定义端口的worker:
dask-worker 191.168.1.1:8786 --worker-port 39040 --dashboard-address 39041 --nanny-port 39042
更多参数可参考here.