dask.array.from_array(np.random.random) 和 dask.array.random.random 有什么区别
What is the difference between dask.array.from_array(np.random.random) and dask.array.random.random
遇到需要训练一堆数据(大约22GiB)的情况,我用两种方法测试生成随机数据并尝试用Dask训练它,然而,Numpy生成的数据会在 Dask.array 有效时引发异常(msgpack:bytes 对象太大)。有人知道为什么吗?
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask import array as da
import numpy as np
import xgboost as xgb
import time
def main(client):
regressor = None
pre = None
n=3000
m=1000000
# numpy generated data will raise an exception
X = np.random.random((m, n))
y = np.random.random((m, 1))
X = da.from_array(X, chunks=(1000, n))
y = da.from_array(y, chunks=(1000, 1))
# data generated by dask.array works well
# X = da.random.random(size=(m, n), chunks=(1000, n))
# y = da.random.random(size=(m, 1), chunks=(1000, 1))
dtrain = xgb.dask.DaskDMatrix(client, X, y)
del X
del y
params = {'tree_method':'gpu_hist'}
watchlist = [(dtrain, 'train')]
start = time.time()
bst = xgb.dask.train(client, params, dtrain, num_boost_round=100, evals=watchlist)
print('consume:', time.time() - start)
if __name__ == '__main__':
with LocalCUDACluster(n_workers=4, device_memory_limit='12 GiB') as cluster:
with Client(cluster) as client:
main(client)
经过一些测试,我找到了原因,da.random.random也是一个延迟函数(所以它只传递给worker的随机定义),在我们的情况下,msgpack限制了数据size(4GiB) 传递给每个工作人员,因此,一般来说,它不适用于超过 4GiB 的数据大小直接与 Dask XGBoost 通信(顺便说一句,我们可以切换到镶木地板数据并将其读取为 dash.dataframe 块数据绕过msgpack的限制)
下面的命令证明了我的猜测。
遇到需要训练一堆数据(大约22GiB)的情况,我用两种方法测试生成随机数据并尝试用Dask训练它,然而,Numpy生成的数据会在 Dask.array 有效时引发异常(msgpack:bytes 对象太大)。有人知道为什么吗?
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask import array as da
import numpy as np
import xgboost as xgb
import time
def main(client):
regressor = None
pre = None
n=3000
m=1000000
# numpy generated data will raise an exception
X = np.random.random((m, n))
y = np.random.random((m, 1))
X = da.from_array(X, chunks=(1000, n))
y = da.from_array(y, chunks=(1000, 1))
# data generated by dask.array works well
# X = da.random.random(size=(m, n), chunks=(1000, n))
# y = da.random.random(size=(m, 1), chunks=(1000, 1))
dtrain = xgb.dask.DaskDMatrix(client, X, y)
del X
del y
params = {'tree_method':'gpu_hist'}
watchlist = [(dtrain, 'train')]
start = time.time()
bst = xgb.dask.train(client, params, dtrain, num_boost_round=100, evals=watchlist)
print('consume:', time.time() - start)
if __name__ == '__main__':
with LocalCUDACluster(n_workers=4, device_memory_limit='12 GiB') as cluster:
with Client(cluster) as client:
main(client)
经过一些测试,我找到了原因,da.random.random也是一个延迟函数(所以它只传递给worker的随机定义),在我们的情况下,msgpack限制了数据size(4GiB) 传递给每个工作人员,因此,一般来说,它不适用于超过 4GiB 的数据大小直接与 Dask XGBoost 通信(顺便说一句,我们可以切换到镶木地板数据并将其读取为 dash.dataframe 块数据绕过msgpack的限制)
下面的命令证明了我的猜测。