将 dask dataframe 转换为 dataframe 太慢,使用并行处理时不会节省时间
convert dask dataframe to dataframe is too slow, it does not save time when using it parallel process
将 pandas 导入为 pd
将 dask.dataframe 导入为 dd
导入时间</p>
<pre><code>import warnings
warnings.simplefilter('ignore')
data['x'] = range(1000)
data['y'] = range(1000)
def add(s):
s['sum'] = s['x']+s['y']
return s
start = time.time()
n_data = data.apply(add, axis=1)
print('it cost time is {} sec'.format(time.time()-start))
start = time.time()
d_data = dd.from_pandas(data, npartitions=10)
s_data = d_data.apply(add, axis=1)
print('it cost time is {} sec'.format(time.time()-start))
start = time.time()
s_data = s_data.compute()
print('but transform it cost time is {} sec'.format(time.time()-start))
结果是:
it cost time is 1.0297248363494873 sec
it cost time is 0.008629083633422852 sec
but transform it cost time is 1.3664238452911377 sec
Pandas 应用很慢。因为您使用 Python 函数逐行操作,所以它必须使用 Python for 循环而不是 C for 循环。
Dask dataframe 的默认调度程序使用线程,这通常非常适合快速矢量化 Pandas 操作,但对受 Python 代码约束的慢速 Pandas 操作无济于事.您可以考虑尝试多处理或分布式调度程序。参见 http://docs.dask.org/en/latest/scheduling.html
但是,我鼓励您在尝试 Dask 之前更好地使用 Pandas。可能使用快速 Pandas API 比 Dask 更能加速你的计算。
将 pandas 导入为 pd
将 dask.dataframe 导入为 dd
导入时间</p>
<pre><code>import warnings
warnings.simplefilter('ignore')
data['x'] = range(1000)
data['y'] = range(1000)
def add(s):
s['sum'] = s['x']+s['y']
return s
start = time.time()
n_data = data.apply(add, axis=1)
print('it cost time is {} sec'.format(time.time()-start))
start = time.time()
d_data = dd.from_pandas(data, npartitions=10)
s_data = d_data.apply(add, axis=1)
print('it cost time is {} sec'.format(time.time()-start))
start = time.time()
s_data = s_data.compute()
print('but transform it cost time is {} sec'.format(time.time()-start))
结果是:
it cost time is 1.0297248363494873 sec
it cost time is 0.008629083633422852 sec
but transform it cost time is 1.3664238452911377 sec
Pandas 应用很慢。因为您使用 Python 函数逐行操作,所以它必须使用 Python for 循环而不是 C for 循环。
Dask dataframe 的默认调度程序使用线程,这通常非常适合快速矢量化 Pandas 操作,但对受 Python 代码约束的慢速 Pandas 操作无济于事.您可以考虑尝试多处理或分布式调度程序。参见 http://docs.dask.org/en/latest/scheduling.html
但是,我鼓励您在尝试 Dask 之前更好地使用 Pandas。可能使用快速 Pandas API 比 Dask 更能加速你的计算。