如何有效地在 dask 中使用 pandas.cut() （或等效的）？

Question

Dask 中是否有等同于 pandas.cut() 的函数？

我尝试在 Python 中对大型数据集进行分箱和分组。它是具有属性（positionX，positionY，能量，时间）的测量电子列表。我需要将它沿着 positionX、positionY 分组并在能量类.

中进行装箱

到目前为止，我可以用 pandas 完成，但我想运行并行完成。所以，我尝试使用dask。

groupby 方法工作得很好，但不幸的是，我运行在尝试 bin 能量中的数据时遇到了困难。我找到了一个使用 pandas.cut() 的解决方案，但它需要在原始数据集上调用 compute()（将其本质上转换为非并行代码）。在 dask 中是否有等同于 pandas.cut() 的方法，或者是否有另一种（优雅的）方法来实现相同的功能？

import dask 
# create dask dataframe from the array
dd = dask.dataframe.from_array(mainArray, chunksize=100000, columns=('posX','posY', 'time', 'energy'))

# Set the bins to bin along energy
bins = range(0, 10000, 500)

# Create the cut in energy (using non-parallel pandas code...)
energyBinner=pandas.cut(dd['energy'],bins)

# Group the data according to posX, posY and energy
grouped = dd.compute().groupby([energyBinner, 'posX', 'posY'])

# Apply the count() method to the data:
numberOfEvents = grouped['time'].count()

非常感谢！

Answer 1

您应该可以做到 dd['energy'].map_partitions(pd.cut, bins)。

如何有效地在 dask 中使用 pandas.cut() （或等效的）？

How to use pandas.cut() (or equivalent) in dask efficiently?

python

pandas

dask

Dask 中是否有等同于 pandas.cut() 的函数？