dask.array.from_array(np.random.random) 和 dask.array.random.random 有什么区别

What is the difference between dask.array.from_array(np.random.random) and dask.array.random.random

遇到需要训练一堆数据(大约22GiB)的情况,我用两种方法测试生成随机数据并尝试用Dask训练它,然而,Numpy生成的数据会在 Dask.array 有效时引发异常(msgpack:bytes 对象太大)。有人知道为什么吗?

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask import array as da
import numpy as np
import xgboost as xgb
import time


def main(client):
    regressor = None
    pre = None
    n=3000
    m=1000000
    # numpy generated data will raise an exception
    X = np.random.random((m, n))
    y = np.random.random((m, 1))
    X = da.from_array(X, chunks=(1000, n))
    y = da.from_array(y, chunks=(1000, 1))

    # data generated by dask.array works well
    # X = da.random.random(size=(m, n), chunks=(1000, n))
    # y = da.random.random(size=(m, 1), chunks=(1000, 1))

    dtrain = xgb.dask.DaskDMatrix(client, X, y)
    del X
    del y

    params = {'tree_method':'gpu_hist'}
    watchlist = [(dtrain, 'train')]
    start = time.time()
    bst = xgb.dask.train(client, params, dtrain, num_boost_round=100, evals=watchlist)
    print('consume:', time.time() - start)


if __name__ == '__main__':
    with LocalCUDACluster(n_workers=4, device_memory_limit='12 GiB') as cluster:
        with Client(cluster) as client:
            main(client)

经过一些测试,我找到了原因,da.random.random也是一个延迟函数(所以它只传递给worker的随机定义),在我们的情况下,msgpack限制了数据size(4GiB) 传递给每个工作人员,因此,一般来说,它不适用于超过 4GiB 的数据大小直接与 Dask XGBoost 通信(顺便说一句,我们可以切换到镶木地板数据并将其读取为 dash.dataframe 块数据绕过msgpack的限制)

下面的命令证明了我的猜测。