如何将两个 dask 数据帧与字符串索引合并?
How to merge two dask dataframes with string indexes?
我正在尝试读取 sql 表并快速执行合并。这是使用 dask 版本 2.8.0。这是我的代码片段:
tdf = dd.read_sql_table('comments', conn_url, index_col='author', divisions=list('1234567890'))
adf = dd.read_sql_table('users', conn_url, index_col='id', divisions=list('1234567890'))
dd.merge(tdf, adf, how='left', left_index=True, right_index=True)
索引的数据类型是'O'。但是我得到一个错误
...
...
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(self, divisions, npartitions, partition_size, freq, force)
1120 return repartition_npartitions(self, npartitions)
1121 elif divisions is not None:
-> 1122 return repartition(self, divisions, force=force)
1123 elif freq is not None:
1124 return repartition_freq(self, freq=freq)
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(df, divisions, force)
5656 tmp = "repartition-split-" + token
5657 out = "repartition-merge-" + token
-> 5658 dsk = repartition_divisions(
5659 df.divisions, divisions, df._name, tmp, out, force=force
5660 )
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition_divisions(a, b, name, out1, out2, force)
5314 ('c', 2): ('b', 3)}
5315 """
-> 5316 check_divisions(b)
5317
5318 if len(b) < 2:
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in check_divisions(divisions)
5276 divisions = list(divisions)
5277 if divisions != sorted(divisions):
-> 5278 raise ValueError("New division must be sorted")
5279 if len(divisions[:-1]) != len(list(unique(divisions[:-1]))):
5280 msg = "New division must be unique, except for the last element"
ValueError: New division must be sorted
我怎样才能实现这个连接?
部门列表确实没有排序,回想一下你的索引是字符串格式,'0'作为字符串将在'1'之前:
# check order
sorted(list("123456890"))
我正在尝试读取 sql 表并快速执行合并。这是使用 dask 版本 2.8.0。这是我的代码片段:
tdf = dd.read_sql_table('comments', conn_url, index_col='author', divisions=list('1234567890'))
adf = dd.read_sql_table('users', conn_url, index_col='id', divisions=list('1234567890'))
dd.merge(tdf, adf, how='left', left_index=True, right_index=True)
索引的数据类型是'O'。但是我得到一个错误
...
...
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(self, divisions, npartitions, partition_size, freq, force)
1120 return repartition_npartitions(self, npartitions)
1121 elif divisions is not None:
-> 1122 return repartition(self, divisions, force=force)
1123 elif freq is not None:
1124 return repartition_freq(self, freq=freq)
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition(df, divisions, force)
5656 tmp = "repartition-split-" + token
5657 out = "repartition-merge-" + token
-> 5658 dsk = repartition_divisions(
5659 df.divisions, divisions, df._name, tmp, out, force=force
5660 )
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in repartition_divisions(a, b, name, out1, out2, force)
5314 ('c', 2): ('b', 3)}
5315 """
-> 5316 check_divisions(b)
5317
5318 if len(b) < 2:
~/continual/venv/lib/python3.8/site-packages/dask/dataframe/core.py in check_divisions(divisions)
5276 divisions = list(divisions)
5277 if divisions != sorted(divisions):
-> 5278 raise ValueError("New division must be sorted")
5279 if len(divisions[:-1]) != len(list(unique(divisions[:-1]))):
5280 msg = "New division must be unique, except for the last element"
ValueError: New division must be sorted
我怎样才能实现这个连接?
部门列表确实没有排序,回想一下你的索引是字符串格式,'0'作为字符串将在'1'之前:
# check order
sorted(list("123456890"))