Dask 数据帧计算失败
Dask dataframe compute failed
我在玩 Python Dask。我遵循了他们的 dataframe example jupyter 笔记本,但是在通过调用 compute()
函数将 dask 数据帧转换为 pandas 数据帧的步骤中失败了。谁能告诉我做错了什么?
代码:
### Cell0
!pip install "dask[complete]"
!pip install pandas
### Cell1
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
df
### Cell2
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3
### Cell3
computed_df = df3.compute()
type(computed_df)
在单元格 3 中执行 computed_df = df3.compute()
时出现错误。
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-6da1eef50c1d> in <module>
----> 1 computed_df = df3.compute()
2 type(computed_df)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(self, **kwargs)
283 dask.base.compute
284 """
--> 285 (result,) = compute(self, traverse=False, **kwargs)
286 return result
287
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(*args, **kwargs)
559 )
560
--> 561 dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
562 keys, postcomputes = [], []
563 for x in collections:
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in collections_to_dsk(collections, optimize_graph, optimizations, **kwargs)
335 for opt, val in groups.items():
336 dsk, keys = _extract_graph_and_keys(val)
--> 337 dsk = opt(dsk, keys, **kwargs)
338
339 for opt in optimizations:
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize(dsk, keys, **kwargs)
20 else:
21 # Perform Blockwise optimizations for HLG input
---> 22 dsk = optimize_dataframe_getitem(dsk, keys=keys)
23 dsk = optimize_blockwise(dsk, keys=keys)
24 dsk = fuse_roots(dsk, keys=keys)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize_dataframe_getitem(dsk, keys)
103 # Project columns and update blocks
104 old = layers[k]
--> 105 new = old.project_columns(columns)[0]
106 if new.name != old.name:
107 columns = list(columns)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/layers.py in project_columns(self, columns)
941 # Apply column projection in IO function
942 try:
--> 943 io_func = self.io_func.project_columns(list(columns))
944 except AttributeError:
945 io_func = self.io_func
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in project_columns(self, columns)
87 func = copy.deepcopy(self)
88 func.columns = columns
---> 89 func.dtypes = {c: self.dtypes[c] for c in columns}
90 return func
91
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in <dictcomp>(.0)
87 func = copy.deepcopy(self)
88 func.columns = columns
---> 89 func.dtypes = {c: self.dtypes[c] for c in columns}
90 return func
91
KeyError: 'gt-d5f81fc97f91e68c389fc34631419acc'
有趣的是,我可以通过以下方式重现此错误:
python=3.9.4
pandas=1.2.4
dask=2021.5.0
distributed=2021.5.0
具体错误发生在这一步:
df2 = df[df.y > 0]
我提出了一个 issue on GitHub,但同时将 dask 版本降级到 2021.4.0
解决了问题(计算结果将显示):
python=3.9.4
pandas=1.2.4
dask=2021.4.1
distributed=2021.4.1
(注意Python这里是3.9,你的环境好像是这样)
我在玩 Python Dask。我遵循了他们的 dataframe example jupyter 笔记本,但是在通过调用 compute()
函数将 dask 数据帧转换为 pandas 数据帧的步骤中失败了。谁能告诉我做错了什么?
代码:
### Cell0
!pip install "dask[complete]"
!pip install pandas
### Cell1
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
df
### Cell2
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3
### Cell3
computed_df = df3.compute()
type(computed_df)
在单元格 3 中执行 computed_df = df3.compute()
时出现错误。
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-6da1eef50c1d> in <module>
----> 1 computed_df = df3.compute()
2 type(computed_df)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(self, **kwargs)
283 dask.base.compute
284 """
--> 285 (result,) = compute(self, traverse=False, **kwargs)
286 return result
287
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in compute(*args, **kwargs)
559 )
560
--> 561 dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
562 keys, postcomputes = [], []
563 for x in collections:
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/base.py in collections_to_dsk(collections, optimize_graph, optimizations, **kwargs)
335 for opt, val in groups.items():
336 dsk, keys = _extract_graph_and_keys(val)
--> 337 dsk = opt(dsk, keys, **kwargs)
338
339 for opt in optimizations:
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize(dsk, keys, **kwargs)
20 else:
21 # Perform Blockwise optimizations for HLG input
---> 22 dsk = optimize_dataframe_getitem(dsk, keys=keys)
23 dsk = optimize_blockwise(dsk, keys=keys)
24 dsk = fuse_roots(dsk, keys=keys)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/optimize.py in optimize_dataframe_getitem(dsk, keys)
103 # Project columns and update blocks
104 old = layers[k]
--> 105 new = old.project_columns(columns)[0]
106 if new.name != old.name:
107 columns = list(columns)
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/layers.py in project_columns(self, columns)
941 # Apply column projection in IO function
942 try:
--> 943 io_func = self.io_func.project_columns(list(columns))
944 except AttributeError:
945 io_func = self.io_func
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in project_columns(self, columns)
87 func = copy.deepcopy(self)
88 func.columns = columns
---> 89 func.dtypes = {c: self.dtypes[c] for c in columns}
90 return func
91
~/.pyenv/versions/3.9.0/lib/python3.9/site-packages/dask/dataframe/io/demo.py in <dictcomp>(.0)
87 func = copy.deepcopy(self)
88 func.columns = columns
---> 89 func.dtypes = {c: self.dtypes[c] for c in columns}
90 return func
91
KeyError: 'gt-d5f81fc97f91e68c389fc34631419acc'
有趣的是,我可以通过以下方式重现此错误:
python=3.9.4
pandas=1.2.4
dask=2021.5.0
distributed=2021.5.0
具体错误发生在这一步:
df2 = df[df.y > 0]
我提出了一个 issue on GitHub,但同时将 dask 版本降级到 2021.4.0
解决了问题(计算结果将显示):
python=3.9.4
pandas=1.2.4
dask=2021.4.1
distributed=2021.4.1
(注意Python这里是3.9,你的环境好像是这样)