列的 Dask Dataframe 总和总是返回标量

Question

我创建了一个 Dask Dataframe（称为 "df"）并且索引为“11”的列具有整数值：

In [62]: df[11]
Out[62]:
Dask Series Structure:
npartitions=42
    int64
      ...
    ...
      ...
      ...
Name: 11, dtype: int64
Dask Name: getitem, 168 tasks

我试图将这些总结为：

df[11].sum()

我得到 dd.Scalar<series-..., dtype=int64> 返回。尽管研究了这可能意味着什么，但我仍然不明白为什么我没有得到返回的数值。我如何将其转化为数值？

Answer 1

我认为你需要 compute 来告诉 Dask 处理之前发生的所有事情:

compute(**kwargs)
Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a numpy.array() and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

df[11].sum().compute()

列的 Dask Dataframe 总和总是返回标量

Dask Dataframe sum of column always returning scalar

python

dataframe

pandas

dask