npartitions 会影响 dask.dataframe.head() 的结果吗？

Question

当运行如下代码时，dask.dataframe.head()的结果取决于npartitions:

import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
ddf = dd.from_pandas(df, npartitions = 3)
print(ddf.head())

这会产生以下结果：

   A  B
0  1  2

然而，当我将 npartitions 设置为 1 或 2 时，我得到了预期的结果：

这似乎很重要，npartitions 小于数据帧的长度。这是故意的吗？

Answer 1

根据文档dd.head() 只检查第一个分区：

head(n=5, compute=True)

First n rows of the dataset

Caveat, this only checks the first n rows of the first partition.

所以答案是肯定的，dd.head() 受 dask 数据帧中有多少个分区的影响。

但是第一个分区中的行数预计会大于使用 dd.head() 时通常要显示的行数——否则使用 dask不应该还清。唯一可能不正确的常见情况是在过滤后采用第一个 n rows/elements，如 .

中所述

npartitions 会影响 dask.dataframe.head() 的结果吗？

does npartitions influence the result of dask.dataframe.head()?

python

pandas

dask