msgpack 无法序列化大型 numpy ndarrays

msgpack could not serialize large numpy ndarrays

我试图通过 client.scatter(np_ndarray) 发送大型 numpy ndarrays。 np_ndarray 大约是 10GB;我收到此错误 msgpack Could not serialize object of type ndarray.

我在创建客户端时使用了 pickle,这样 Client(self.adr, serializers=['dask', 'pickle'])

谢谢!

与其将大数据发送给工作人员,不如将数据存储(本地或远程,视情况而定)并要求工作人员加载它可能更有效。像这样:

from joblib import dump, load

path_to_pickle = 'large_numpy.pickle'
dump(large_numpy, path_to_pickle)

def myfunc(path_to_pickle):
    large_numpy = load(path_to_pickle)
    # do something

fut = client.submit(myfunc, path_to_pickle)

如果您想使用 msgpack,则最大限制约为 4.3 GB,请参阅 docs:

  • a value of an Integer object is limited from -(2^63) upto (2^64)-1
  • maximum length of a Binary object is (2^32)-1
  • maximum byte size of a String object is (2^32)-1

有一些策略的讨论here, specifically if it's possible to encode the object as a string, the string can be split into multiple parts and then each part sent individually. The receiving side would then have to concatenate these and decode. Another option is streaming

回答你的其他三个问题:

Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?

是的,Dask 会根据您的数据 select 默认序列化程序,参考:Dask Docs - Serialization

I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?

我咨询了一位 Dask 贡献者,看起来现在或不久的将来都没有支持它的计划。也就是说,请随时开始讨论以收集更多想法。 :)

When I initialize my client this way, what are the main advantages and disadvantages?

Dask 中的序列化很棘手,因此很难定义(缺点)优势。但是,一般来说,不推荐手动指定序列化器。