模糊匹配 returns NaN for best_match_score

Fuzzymatcher returns NaN for best_match_score

我在从 fuzzymatcher 库执行 fuzzy_left_join 时观察到奇怪的行为。尝试连接两个 df,左边一个有 5217 条记录,右边一个有 8734 条记录,所有带 best_match_score 的记录是 71 条记录,这看起来很奇怪。为了获得更好的结果,我什至删除了所有数字,只留下用于连接列的字母字符。在合并的table中,从右边table开始的id列是NaN,这也是奇怪的结果。

left table - 连接“amazon_s3_name”的列。第一项 - limonig

+------+---------+-------+-----------+------------------------------------+
|  id  | product | price | category  |           amazon_s3_name           |
+------+---------+-------+-----------+------------------------------------+
|    1 | A       |  1.49 | fruits    | limonig                            |
| 8964 | B       |  1.39 | beverages | studencajfuzelimonilimonetatrevaml |
| 9659 | C       |  2.79 | beverages | studencajfuzelimonilimtreval       |
+------+---------+-------+-----------+------------------------------------+

右 table - 连接“amazon_s3_name”的列 - 最后一项 - limoni

+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
|  id  |                                                       picture                                                              |                    amazon_s3_name          |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
|  191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg                          | ahmadcajlimonidjindjifilxg                 |
|  192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg                       | ahmadcajlimonidjindjifilxgg                |
|  204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg               | ahmadcajlimonidjindjifilxgg                |
| 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg              | cajstudenfuzetealimonilimonovatrevalpet    |
| 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg                 | lesieursalatensosslimonimaslinovomaslo     |
| 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml  |
| 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg                                                 | limoni                                     |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+

merged table - 正如我们在合并后看到的 table best_match_scoreNaN

+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left  | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
|  0 | NaN              | 0_left    | None       |  1.49 | Fruits   | Limoni500g09700112   | NaN        | limonig             | NaN         | NaN                  |
|  2 | NaN              | 2_left    | None       |  1.69 | Bio      | Morkovi1kgbr09700132 | NaN        | morkovikgbr         | NaN         | NaN                  |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+

你可以 polyfuzz 试一试。使用示例设置,例如使用 TF-IDFBert,然后 运行:

model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
df1['To'] = model.get_matches()['To']

然后合并:

df1.merge(df2, left_on='To', right_on='amazon_s3_name')