如何将两个 dask 数据帧与字符串索引合并?

How to merge two dask dataframes with string indexes?

我正在尝试读取 sql 表并快速执行合并。这是使用 dask 版本 2.8.0。这是我的代码片段:

tdf = dd.read_sql_table('comments', conn_url, index_col='author', divisions=list('1234567890'))
adf = dd.read_sql_table('users', conn_url, index_col='id', divisions=list('1234567890'))
dd.merge(tdf, adf, how='left', left_index=True, right_index=True)

索引的数据类型是'O'。但是我得到一个错误

...
...
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(self, divisions, npartitions, partition_size, freq, force)
   1120             return repartition_npartitions(self, npartitions)
   1121         elif divisions is not None:
-> 1122             return repartition(self, divisions, force=force)
   1123         elif freq is not None:
   1124             return repartition_freq(self, freq=freq)

~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(df, divisions, force)
   5656         tmp = "repartition-split-" + token
   5657         out = "repartition-merge-" + token
-> 5658         dsk = repartition_divisions(
   5659             df.divisions, divisions, df._name, tmp, out, force=force
   5660         )

~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition_divisions(a, b, name, out1, out2, force)
   5314      ('c', 2): ('b', 3)}
   5315     """
-> 5316     check_divisions(b)
   5317 
   5318     if len(b) < 2:

~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in check_divisions(divisions)
   5276     divisions = list(divisions)
   5277     if divisions != sorted(divisions):
-> 5278         raise ValueError("New division must be sorted")
   5279     if len(divisions[:-1]) != len(list(unique(divisions[:-1]))):
   5280         msg = "New division must be unique, except for the last element"

ValueError: New division must be sorted

我怎样才能实现这个连接?

部门列表确实没有排序,回想一下你的索引是字符串格式,'0'作为字符串将在'1'之前:

# check order
sorted(list("123456890"))