dask 索引的行为不像列(不像 pandas 中那样)

dask index not behaving like a column (and not like in pandas)

在此错误报告中:https://github.com/dask/dask/issues/8319我遇到了以下解决方法的问题。由于它似乎超出了该错误报告的范围,我将在这里询问最初的问题:

import pandas as pd
import dask

# some example dataframe
df = pd.DataFrame([{"a": "A", "b": "B"}, {"a": "@", "b": "β"}, {"a": "Aa", "b": "Bb"}, {"a": "aa", "b": "bb"}])

# pandas version
df2 = df.set_index("a")
df2[df2.index.str.endswith("a")]
# this works, as pandas allows an "array" of the right length regardless of having the same index

# dask version
ddf = dask.dataframe.from_pandas(df, npartitions=2)
ddf2 = ddf.set_index("a")

# this works with a regular column
ddf2[ddf2.b.str.endswith("b")].compute()
# selects the rows where column b ends with "b"

# indices don't behave like columns
ddf2[ddf2.index.str.endswith("a")].compute()
# TypeError: '<' not supported between instances of 'bool' and 'str'

我不确定这是 dask 中的错误,还是 dask 中不可能的东西,因为一旦您使用多个分区,您就不知道如何在分区上映射索引。 (除了这在 map_partitions 中工作正常,因为你只是在处理 pandas 数据帧)

有没有我遗漏的东西,或者这是根深蒂固无法轻易修复的东西?

相关:BUG:Dask dataframe 无法处理字符串索引 #3269Better handling for arrays/series of keys in dask.dataframe.loc #8254(均打开)。

我认为当前的解决方法是创建一个布尔系列并计算结果,然后再将其用于索引到 DataFrame 中。这会引发警告,但在这个例子中似乎可以解决问题:

In [19]: ddf2[ddf2.index.to_series().str.endswith('a').compute()].compute()
/.../lib/python3.9/site-packages/dask/dataframe/core.py:3703: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  meta = self._meta[_extract_meta(key)]
/.../lib/python3.9/site-packages/dask/core.py:121: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*(_execute_task(a, cache) for a in args))
Out[19]:
     b
a
Aa  Bb
aa  bb