dask dataframe 删除重复的索引值

Question

我在 python 2.7 中使用 dask dataframe 并想从我的 df 中删除重复的索引值。

当使用 pandas 我会使用

df = df[~df.index.duplicated(keep = "first")]

而且有效

当尝试对 dask 数据帧执行相同操作时，我得到

AttributeError: 'Index' object has no attribute 'duplicated'

我可以重置索引，而不是使用作为索引的列来删除重复项，但我想尽可能避免它

我可以使用 df.compute() 而不是删除重复的索引值，但是这个 df 对于内存来说太大了。

如何使用 dask 数据帧从我的数据帧中删除重复的索引值？

Answer 1

我认为您需要通过 to_series, keep='first' should be omit, because default parameter in duplicated:

将 index 转换为 Series

df = df[~df.index.to_series().duplicated()]

dask dataframe drop duplicate index values