msgpack 无法序列化大型 numpy ndarrays
msgpack could not serialize large numpy ndarrays
我试图通过 client.scatter(np_ndarray)
发送大型 numpy
ndarrays。 np_ndarray 大约是 10GB;我收到此错误 msgpack Could not serialize object of type ndarray
.
我在创建客户端时使用了 pickle
,这样 Client(self.adr, serializers=['dask', 'pickle'])
。
msgpack 无法管理的数据大小是否有限制?
当数据由scatter,
发送时总是使用msgpack还是dask
根据数据类型决定协议?
我注意到 Msgpack-Numpy
有一个项目。您是否计划在 dask
中添加对它的支持,以防我在 dask
中描述最终问题?
当我以这种方式初始化客户端时,主要优点和缺点是什么?
谢谢!
与其将大数据发送给工作人员,不如将数据存储(本地或远程,视情况而定)并要求工作人员加载它可能更有效。像这样:
from joblib import dump, load
path_to_pickle = 'large_numpy.pickle'
dump(large_numpy, path_to_pickle)
def myfunc(path_to_pickle):
large_numpy = load(path_to_pickle)
# do something
fut = client.submit(myfunc, path_to_pickle)
如果您想使用 msgpack
,则最大限制约为 4.3 GB,请参阅 docs:
- a value of an Integer object is limited from -(2^63) upto (2^64)-1
- maximum length of a Binary object is (2^32)-1
- maximum byte size of a String object is (2^32)-1
有一些策略的讨论here, specifically if it's possible to encode the object as a string, the string can be split into multiple parts and then each part sent individually. The receiving side would then have to concatenate these and decode. Another option is streaming。
回答你的其他三个问题:
Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?
是的,Dask 会根据您的数据 select 默认序列化程序,参考:Dask Docs - Serialization
I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?
我咨询了一位 Dask 贡献者,看起来现在或不久的将来都没有支持它的计划。也就是说,请随时开始讨论以收集更多想法。 :)
When I initialize my client this way, what are the main advantages and disadvantages?
Dask 中的序列化很棘手,因此很难定义(缺点)优势。但是,一般来说,不推荐手动指定序列化器。
我试图通过 client.scatter(np_ndarray)
发送大型 numpy
ndarrays。 np_ndarray 大约是 10GB;我收到此错误 msgpack Could not serialize object of type ndarray
.
我在创建客户端时使用了 pickle
,这样 Client(self.adr, serializers=['dask', 'pickle'])
。
msgpack 无法管理的数据大小是否有限制?
当数据由
scatter,
发送时总是使用msgpack还是dask
根据数据类型决定协议?我注意到
Msgpack-Numpy
有一个项目。您是否计划在dask
中添加对它的支持,以防我在dask
中描述最终问题?当我以这种方式初始化客户端时,主要优点和缺点是什么?
谢谢!
与其将大数据发送给工作人员,不如将数据存储(本地或远程,视情况而定)并要求工作人员加载它可能更有效。像这样:
from joblib import dump, load
path_to_pickle = 'large_numpy.pickle'
dump(large_numpy, path_to_pickle)
def myfunc(path_to_pickle):
large_numpy = load(path_to_pickle)
# do something
fut = client.submit(myfunc, path_to_pickle)
如果您想使用 msgpack
,则最大限制约为 4.3 GB,请参阅 docs:
- a value of an Integer object is limited from -(2^63) upto (2^64)-1
- maximum length of a Binary object is (2^32)-1
- maximum byte size of a String object is (2^32)-1
有一些策略的讨论here, specifically if it's possible to encode the object as a string, the string can be split into multiple parts and then each part sent individually. The receiving side would then have to concatenate these and decode. Another option is streaming。
回答你的其他三个问题:
Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?
是的,Dask 会根据您的数据 select 默认序列化程序,参考:Dask Docs - Serialization
I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?
我咨询了一位 Dask 贡献者,看起来现在或不久的将来都没有支持它的计划。也就是说,请随时开始讨论以收集更多想法。 :)
When I initialize my client this way, what are the main advantages and disadvantages?
Dask 中的序列化很棘手,因此很难定义(缺点)优势。但是,一般来说,不推荐手动指定序列化器。