如何在两个 DataFrame 之间有效地匹配文本

Question

我有一些文本数据： data1

id	comment	title
user_A	good	a file name
user_B	a better way is…	is there some good sugg？
user_C	a another way is…	is there some good sugg？
user_C	I have been using Pandas for a long time, so I…	a book

你可以使用

pd.read_clipboard()

复制它。

data2

userid	title
user_X	is there some good sugg？
user_Y	a great idea…
user_Z	a file name
user_W	a book

期望的输出

uid	comment	title	uid
user_A	good	a file name	user_Z
user_B	a better way is…	is there some good sugg？	user_X
user_C	a another way is…	is there some good sugg？	user_X
user_C	I have been using Pandas for a long time, so I…	a book	user_W

一个简单的方法是合并 title 在 pandas 中：

dataall = pd.merge(
    data1,data2,
    on = 'title',
    how ='left'
)

但是内存很贵。 data1 的大小是 (2942087, 7)（或者有时可能超过行号的 3 倍），data2 的大小是 (47516640, 4) 我的内存大小是32GB，但是不够用我也尝试使用 polars 在 polars 中：

dataall = data1.join(
    data2,
    on = 'title',
    how ='left'
)

发生错误


Canceled future for execute_request message before replies were done

我试过polars中的函数is_in，把文本编码成数字，速度很快，但不知道怎么实现。
pandas/polars/numpy有没有高效可行的方法？

根据@ritchie46 的建议
-----编辑 2022-5-24 16:00:10

import polars as pl
pl.Config.set_global_string_cache()

data1 = pl.read_parquet('data1.parquet.gzip').lazy()
data2 = pl.read_parquet('data2.parquet.gzip').lazy()

data1 = data1.with_column(pl.col('source_post_title').cast(pl.Categorical))
data2 = data2.with_column(pl.col('source_post_title').cast(pl.Categorical))


dataall = data1.join(
    data2,
    on = 'source_post_title',
    how ='left'
).collect()

代码似乎工作了一段时间然后

Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.

是不是我的处理器本身太弱了？我的 CPU 是 i7-10850H

Answer 1

如果您的连接键中有很多重复项，输出 table 可能比您要连接的任何 table 都大很多。

可能对 polars 有帮助的是：

使用 Categorical 数据类型，以便缓存重复项。
删除重复的连接键，这样输出 table 就不会爆炸（如果允许正确的话）。
直接从 scan 级别使用 polars lazy API。这样中间结果会被清除并且不会保留在 RAM 中。除此之外，极地可能会做其他优化来减少内存压力。

如果不需要所有输出数据，假设只需要连接结果的前 x 百万行，则可以使用 polars lazy。

lf_a = pl.scan_parquet("data1")
lf_a = # some more work

lf_b = p.scan_parquet("data2"_
lf_b = # some more work

# take only first million rows
N = int(1e6)

# because of the head operation the join will not materialize a full output table
lf_a.join(lf_b).head(N)

如何在两个 DataFrame 之间有效地匹配文本

How to match text efficiently between two DataFrames

python

numpy

pandas

python-polars