将 dask dataframe 转换为 dataframe 太慢,使用并行处理时不会节省时间

convert dask dataframe to dataframe is too slow, it does not save time when using it parallel process

将 pandas 导入为 pd 将 dask.dataframe 导入为 dd 导入时间</p> <pre><code>import warnings warnings.simplefilter('ignore') data['x'] = range(1000) data['y'] = range(1000) def add(s): s['sum'] = s['x']+s['y'] return s start = time.time() n_data = data.apply(add, axis=1) print('it cost time is {} sec'.format(time.time()-start)) start = time.time() d_data = dd.from_pandas(data, npartitions=10) s_data = d_data.apply(add, axis=1) print('it cost time is {} sec'.format(time.time()-start)) start = time.time() s_data = s_data.compute() print('but transform it cost time is {} sec'.format(time.time()-start))

结果是:

it cost time is 1.0297248363494873 sec

it cost time is 0.008629083633422852 sec

but transform it cost time is 1.3664238452911377 sec

Pandas 应用很慢。因为您使用 Python 函数逐行操作,所以它必须使用 Python for 循环而不是 C for 循环。

Dask dataframe 的默认调度程序使用线程,这通常非常适合快速矢量化 Pandas 操作,但对受 Python 代码约束的慢速 Pandas 操作无济于事.您可以考虑尝试多处理或分布式调度程序。参见 http://docs.dask.org/en/latest/scheduling.html

但是,我鼓励您在尝试 Dask 之前更好地使用 Pandas。可能使用快速 Pandas API 比 Dask 更能加速你的计算。