如何使用分区 DF（非唯一索引）中的索引列表 select 数据？

How to select data with list of indexes from a partitioned DF (non-unique indexes)?

问题

我有一个数据帧 df，其索引不是单调递增超过 4 个分区，这意味着每个分区都使用 [0..N] 进行索引。我需要 select 基于索引列表 [0..M] 的行，其中 M > N。使用 loc 会产生不一致的输出，因为有多行由 0 索引（参见示例）。

换句话说，我需要克服 Dask 和 Pandas' reset_index 之间的差异，因为它很容易解决我的问题。

例子

print df.loc[0].compute() 结果：

   Unnamed: 0  best_answer  thread_id  ty_avc    ty_ber  ty_cjr  ty_cpc  \
0           0            1          1       1  0.052174       9      18   
0           0            1       5284      12  0.039663      34      60   
0           0            1      18132       2  0.042254       7      20   
0           0            1      44211       4  0.025000       5       5

可能的解决方案

重新分区 df 到 1 个单独的分区和 reset_index，不喜欢，因为不适合内存；
添加具有 [0..M] 个索引的列并使用 set_index, discouraged in performance tips;
此的解决方案解决了另一个问题，因为他的 df 具有唯一索引；
将索引列表拆分为 npartitions 部分，应用偏移量计算并使用 map_partitions

我想不出其他解决方案...可能最后一个更有效，但不确定它是否真的可行。

通常 Dask.dataframe 不会跟踪构成 dask.dataframe 的 pandas 数据帧的长度。我怀疑你的选项 4 是最好的。您也可以考虑使用 dask.delayed

另见 http://dask.pydata.org/en/latest/delayed-collections.html