msgpack 无法序列化大型 numpy ndarrays

Question

我试图通过 client.scatter(np_ndarray) 发送大型 numpy ndarrays。 np_ndarray 大约是 10GB；我收到此错误 msgpack Could not serialize object of type ndarray.

我在创建客户端时使用了 pickle，这样 Client(self.adr, serializers=['dask', 'pickle'])。

msgpack 无法管理的数据大小是否有限制？
当数据由scatter,发送时总是使用msgpack还是dask根据数据类型决定协议？
我注意到 Msgpack-Numpy 有一个项目。您是否计划在 dask 中添加对它的支持，以防我在 dask 中描述最终问题？
当我以这种方式初始化客户端时，主要优点和缺点是什么？

谢谢！

Answer 1

与其将大数据发送给工作人员，不如将数据存储（本地或远程，视情况而定）并要求工作人员加载它可能更有效。像这样：

from joblib import dump, load

path_to_pickle = 'large_numpy.pickle'
dump(large_numpy, path_to_pickle)

def myfunc(path_to_pickle):
    large_numpy = load(path_to_pickle)
    # do something

fut = client.submit(myfunc, path_to_pickle)

Answer 2

如果您想使用 msgpack，则最大限制约为 4.3 GB，请参阅 docs:

a value of an Integer object is limited from -(2^63) upto (2^64)-1

maximum length of a Binary object is (2^32)-1

maximum byte size of a String object is (2^32)-1

有一些策略的讨论here, specifically if it's possible to encode the object as a string, the string can be split into multiple parts and then each part sent individually. The receiving side would then have to concatenate these and decode. Another option is streaming。

Answer 3

回答你的其他三个问题：

Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?

是的，Dask 会根据您的数据 select 默认序列化程序，参考：Dask Docs - Serialization

I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?

我咨询了一位 Dask 贡献者，看起来现在或不久的将来都没有支持它的计划。也就是说，请随时开始讨论以收集更多想法。 :)

When I initialize my client this way, what are the main advantages and disadvantages?

Dask 中的序列化很棘手，因此很难定义（缺点）优势。但是，一般来说，不推荐手动指定序列化器。

msgpack 无法序列化大型 numpy ndarrays

msgpack could not serialize large numpy ndarrays

python

msgpack

dask

dask-distributed

numpy-ndarray