dask-kubernetes KubeCluster 卡住了
dask-kubernetes KubeCluster stuck
我正在努力起床并运行 dask 在 kubernetes 上。下面实际上是 dask-kubernetes 的 hello world,但我被下面的错误困住了。
main.py:
import os
from dask_kubernetes import KubeCluster
from dask.distributed import Client
import dask.array as da
if __name__ == '__main__':
path_to_src = os.path.dirname(os.path.abspath(__file__))
cluster = KubeCluster(os.path.join(path_to_src, 'pod-spec.yaml'), namespace='124381-dev')
print('Cluster constructed')
cluster.scale(10)
# print('Cluster scaled')
# Connect Dask to the cluster
client = Client(cluster)
print('Client constructed')
# Create a large array and calculate the mean
array = da.ones((100, 100, 100))
print('Created big array')
print(array.mean().compute()) # Should print 1.0
print('Computed mean')
输出:
$ python src/main.py
Creating scheduler pod on cluster. This may take some time.
Forwarding from 127.0.0.1:60611 -> 8786
Handling connection for 60611
Handling connection for 60611
Handling connection for 60611
Cluster constructed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f1f874b8130>>, <Task finished name='Task-54' coro=<SpecCluster._correct_state_internal() done, defined at /home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/distributed/deploy/spec.py:327> exception=TypeError("unsupported operand type(s) for +=: 'NoneType' and 'list'")>)
Traceback (most recent call last):
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/distributed/deploy/spec.py", line 358, in _correct_state_internal
worker = cls(self.scheduler.address, **opts)
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/dask_kubernetes/core.py", line 151, in __init__
self.pod_template.spec.containers[0].args += worker_name_args
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'list'
Handling connection for 60611
Handling connection for 60611
Client constructed
Created big array
请注意,输出末尾没有终端提示 - 它仍然是 运行,但从未进行过。在另一个终端中,kubectl get pods
也将“悬崖测试”显示为 运行。
pod-spec.yaml:
apiVersion: v1
kind: Pod
metadata:
name: cliff-testing
labels:
app: cliff-docker-test
spec:
imagePullSecrets:
- name: <redacted>
securityContext:
runAsUser: 1000
restartPolicy: OnFailure
containers:
- name: cliff-test-container
image: <redacted: works with docker pull>
imagePullPolicy: Always
resources:
limits:
cpu: 2
memory: 4G
requests:
cpu: 1
memory: 2G
在广告连播模板(pod-spec.yaml)中,设置了字段metadata.name
。删除它允许代码为 运行。 dask-kubernetes 似乎创建了一个名为“dask--”的调度程序 pod,并遵循与 worker 相同的命名方法。通过修复 pod 模板中的名称,dask-kubernetes 试图创建与调度程序 pod(以及彼此)同名的 worker pods,这是非法的。
如果要以不同的方式命名 pods,可以在构造 KubeCluster
时使用关键字参数 name
来命名 pods(dask 会自动附加一个每个 pod 的这个名称的随机字符串)。
例如,下面的示例将导致每个 pod(调度程序和工作程序)被命名为“my-dask-pods-”
from dask_kubernetes import KubeCluster
cluster = KubeCluster('pod-spec.yaml', name='my-dask-pods-')
我正在努力起床并运行 dask 在 kubernetes 上。下面实际上是 dask-kubernetes 的 hello world,但我被下面的错误困住了。
main.py:
import os
from dask_kubernetes import KubeCluster
from dask.distributed import Client
import dask.array as da
if __name__ == '__main__':
path_to_src = os.path.dirname(os.path.abspath(__file__))
cluster = KubeCluster(os.path.join(path_to_src, 'pod-spec.yaml'), namespace='124381-dev')
print('Cluster constructed')
cluster.scale(10)
# print('Cluster scaled')
# Connect Dask to the cluster
client = Client(cluster)
print('Client constructed')
# Create a large array and calculate the mean
array = da.ones((100, 100, 100))
print('Created big array')
print(array.mean().compute()) # Should print 1.0
print('Computed mean')
输出:
$ python src/main.py
Creating scheduler pod on cluster. This may take some time.
Forwarding from 127.0.0.1:60611 -> 8786
Handling connection for 60611
Handling connection for 60611
Handling connection for 60611
Cluster constructed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f1f874b8130>>, <Task finished name='Task-54' coro=<SpecCluster._correct_state_internal() done, defined at /home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/distributed/deploy/spec.py:327> exception=TypeError("unsupported operand type(s) for +=: 'NoneType' and 'list'")>)
Traceback (most recent call last):
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/distributed/deploy/spec.py", line 358, in _correct_state_internal
worker = cls(self.scheduler.address, **opts)
File "/home/cliff/anaconda3/envs/dask/lib/python3.8/site-packages/dask_kubernetes/core.py", line 151, in __init__
self.pod_template.spec.containers[0].args += worker_name_args
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'list'
Handling connection for 60611
Handling connection for 60611
Client constructed
Created big array
请注意,输出末尾没有终端提示 - 它仍然是 运行,但从未进行过。在另一个终端中,kubectl get pods
也将“悬崖测试”显示为 运行。
pod-spec.yaml:
apiVersion: v1
kind: Pod
metadata:
name: cliff-testing
labels:
app: cliff-docker-test
spec:
imagePullSecrets:
- name: <redacted>
securityContext:
runAsUser: 1000
restartPolicy: OnFailure
containers:
- name: cliff-test-container
image: <redacted: works with docker pull>
imagePullPolicy: Always
resources:
limits:
cpu: 2
memory: 4G
requests:
cpu: 1
memory: 2G
在广告连播模板(pod-spec.yaml)中,设置了字段metadata.name
。删除它允许代码为 运行。 dask-kubernetes 似乎创建了一个名为“dask-
如果要以不同的方式命名 pods,可以在构造 KubeCluster
时使用关键字参数 name
来命名 pods(dask 会自动附加一个每个 pod 的这个名称的随机字符串)。
例如,下面的示例将导致每个 pod(调度程序和工作程序)被命名为“my-dask-pods-
from dask_kubernetes import KubeCluster
cluster = KubeCluster('pod-spec.yaml', name='my-dask-pods-')