Dask 数据帧计算失败

Dask dataframe compute failed

我在玩 Python Dask。我遵循了他们的 dataframe example jupyter 笔记本,但是在通过调用 compute() 函数将 dask 数据帧转换为 pandas 数据帧的步骤中失败了。谁能告诉我做错了什么?

代码:

### Cell0
!pip install "dask[complete]"
!pip install pandas

### Cell1 
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
df

### Cell2 
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3

### Cell3
computed_df = df3.compute()
type(computed_df)

在单元格 3 中执行 computed_df = df3.compute() 时出现错误。

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-6da1eef50c1d> in <module>
----> 1 computed_df = df3.compute()
      2 type(computed_df)

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(self, **kwargs)
    283         dask.base.compute
    284         """
--> 285         (result,) = compute(self, traverse=False, **kwargs)
    286         return result
    287 

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(*args, **kwargs)
    559     )
    560 
--> 561     dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
    562     keys, postcomputes = [], []
    563     for x in collections:

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in collections_to_dsk(collections, optimize_graph, optimizations, **kwargs)
    335         for opt, val in groups.items():
    336             dsk, keys = _extract_graph_and_keys(val)
--> 337             dsk = opt(dsk, keys, **kwargs)
    338 
    339             for opt in optimizations:

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize(dsk, keys, **kwargs)
     20     else:
     21         # Perform Blockwise optimizations for HLG input
---> 22         dsk = optimize_dataframe_getitem(dsk, keys=keys)
     23         dsk = optimize_blockwise(dsk, keys=keys)
     24         dsk = fuse_roots(dsk, keys=keys)

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize_dataframe_getitem(dsk, keys)
    103         # Project columns and update blocks
    104         old = layers[k]
--> 105         new = old.project_columns(columns)[0]
    106         if new.name != old.name:
    107             columns = list(columns)

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/layers.py in project_columns(self, columns)
    941             # Apply column projection in IO function
    942             try:
--> 943                 io_func = self.io_func.project_columns(list(columns))
    944             except AttributeError:
    945                 io_func = self.io_func

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in project_columns(self, columns)
     87         func = copy.deepcopy(self)
     88         func.columns = columns
---> 89         func.dtypes = {c: self.dtypes[c] for c in columns}
     90         return func
     91 

~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in <dictcomp>(.0)
     87         func = copy.deepcopy(self)
     88         func.columns = columns
---> 89         func.dtypes = {c: self.dtypes[c] for c in columns}
     90         return func
     91 

KeyError: 'gt-d5f81fc97f91e68c389fc34631419acc'

有趣的是,我可以通过以下方式重现此错误:

python=3.9.4
pandas=1.2.4
dask=2021.5.0
distributed=2021.5.0

具体错误发生在这一步:

df2 = df[df.y > 0]

我提出了一个 issue on GitHub,但同时将 dask 版本降级到 2021.4.0 解决了问题(计算结果将显示):

python=3.9.4
pandas=1.2.4
dask=2021.4.1
distributed=2021.4.1

(注意Python这里是3.9,你的环境好像是这样)