npartitions 在 Dask 数据帧中的作用是什么？

Question

我在很多函数中看到参数npartitions，但我不明白它有什么用。

http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

head(...)

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

repartition(...)

Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.

这里的分区数大概是5个：

（图片来源：http://dask.pydata.org/en/latest/dataframe-overview.html）

Answer 1

npartitions 属性是组成单个 Dask 数据帧的 Pandas 数据帧的数量。这会以两种主要方式影响性能。

如果您没有足够的分区，那么您可能无法有效地使用所有内核。例如，如果您的 dask.dataframe 只有一个分区，那么一次只能运行一个核心。
如果您有太多分区，那么调度程序可能会产生大量开销来决定在何处计算每个任务。

通常，您需要的分区数量是核心数的几倍。每个任务在调度程序中占用几百微秒。

您可以在数据摄取时使用 read_csv(...) 中的 blocksize= 等参数或之后使用 .repartition(...) 方法确定分区数。

Answer 2

我试图检查适合我的情况的最佳数字是多少。我在 8 核笔记本电脑上工作。我有 100Gb csv 文件，其中包含 250M 行和 25 列。我运行在 1,5,30,1000 个分区上“描述”函数

df = df.repartition(npartitions=1)
a1=df['age'].describe().compute()
df = df.repartition(npartitions=5)
a2=df['age'].describe().compute()
df = df.repartition(npartitions=30)
a3=df['age'].describe().compute()
df = df.repartition(npartitions=100)
a4=df['age'].describe().compute()

关于速度：

5,30 > 大约 3 分钟

1, 1000 > 大约 9 分钟

但是...我发现当我使用多个分区时，中位数或百分位数等“排序”函数给出了错误的数字。

1 个分区给出正确的数字（我使用 pandas 和 dask 用小数据检查了它）

npartitions 在 Dask 数据帧中的作用是什么？

What is the role of npartitions in a Dask dataframe?

python

dataframe

dask