dask 索引的行为不像列(不像 pandas 中那样)
dask index not behaving like a column (and not like in pandas)
在此错误报告中:https://github.com/dask/dask/issues/8319我遇到了以下解决方法的问题。由于它似乎超出了该错误报告的范围,我将在这里询问最初的问题:
import pandas as pd
import dask
# some example dataframe
df = pd.DataFrame([{"a": "A", "b": "B"}, {"a": "@", "b": "β"}, {"a": "Aa", "b": "Bb"}, {"a": "aa", "b": "bb"}])
# pandas version
df2 = df.set_index("a")
df2[df2.index.str.endswith("a")]
# this works, as pandas allows an "array" of the right length regardless of having the same index
# dask version
ddf = dask.dataframe.from_pandas(df, npartitions=2)
ddf2 = ddf.set_index("a")
# this works with a regular column
ddf2[ddf2.b.str.endswith("b")].compute()
# selects the rows where column b ends with "b"
# indices don't behave like columns
ddf2[ddf2.index.str.endswith("a")].compute()
# TypeError: '<' not supported between instances of 'bool' and 'str'
我不确定这是 dask 中的错误,还是 dask 中不可能的东西,因为一旦您使用多个分区,您就不知道如何在分区上映射索引。 (除了这在 map_partitions 中工作正常,因为你只是在处理 pandas 数据帧)
有没有我遗漏的东西,或者这是根深蒂固无法轻易修复的东西?
相关:BUG:Dask dataframe 无法处理字符串索引
#3269 和 Better handling for arrays/series of keys in dask.dataframe.loc #8254(均打开)。
我认为当前的解决方法是创建一个布尔系列并计算结果,然后再将其用于索引到 DataFrame 中。这会引发警告,但在这个例子中似乎可以解决问题:
In [19]: ddf2[ddf2.index.to_series().str.endswith('a').compute()].compute()
/.../lib/python3.9/site-packages/dask/dataframe/core.py:3703: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
meta = self._meta[_extract_meta(key)]
/.../lib/python3.9/site-packages/dask/core.py:121: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*(_execute_task(a, cache) for a in args))
Out[19]:
b
a
Aa Bb
aa bb
在此错误报告中:https://github.com/dask/dask/issues/8319我遇到了以下解决方法的问题。由于它似乎超出了该错误报告的范围,我将在这里询问最初的问题:
import pandas as pd
import dask
# some example dataframe
df = pd.DataFrame([{"a": "A", "b": "B"}, {"a": "@", "b": "β"}, {"a": "Aa", "b": "Bb"}, {"a": "aa", "b": "bb"}])
# pandas version
df2 = df.set_index("a")
df2[df2.index.str.endswith("a")]
# this works, as pandas allows an "array" of the right length regardless of having the same index
# dask version
ddf = dask.dataframe.from_pandas(df, npartitions=2)
ddf2 = ddf.set_index("a")
# this works with a regular column
ddf2[ddf2.b.str.endswith("b")].compute()
# selects the rows where column b ends with "b"
# indices don't behave like columns
ddf2[ddf2.index.str.endswith("a")].compute()
# TypeError: '<' not supported between instances of 'bool' and 'str'
我不确定这是 dask 中的错误,还是 dask 中不可能的东西,因为一旦您使用多个分区,您就不知道如何在分区上映射索引。 (除了这在 map_partitions 中工作正常,因为你只是在处理 pandas 数据帧)
有没有我遗漏的东西,或者这是根深蒂固无法轻易修复的东西?
相关:BUG:Dask dataframe 无法处理字符串索引 #3269 和 Better handling for arrays/series of keys in dask.dataframe.loc #8254(均打开)。
我认为当前的解决方法是创建一个布尔系列并计算结果,然后再将其用于索引到 DataFrame 中。这会引发警告,但在这个例子中似乎可以解决问题:
In [19]: ddf2[ddf2.index.to_series().str.endswith('a').compute()].compute()
/.../lib/python3.9/site-packages/dask/dataframe/core.py:3703: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
meta = self._meta[_extract_meta(key)]
/.../lib/python3.9/site-packages/dask/core.py:121: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*(_execute_task(a, cache) for a in args))
Out[19]:
b
a
Aa Bb
aa bb