如何在sqlite中比较文本和select相似的句子?

How to compare text and select similar sentences in sqlite?

我正在使用 NLP 从不同年份的 SEC 文件中提取包含某些关键字的句子。我通过 pandas 数据帧将输出存储在 sqlite 中。到目前为止,一切都很好。 当我想比较两个不同年份的句子时,比如 2022 年和 2021 年,问题就来了。

我一直在使用以下查询:

query = "select Nvidia_2022.Research as Research_2022, Nvidia_2021.Research as Research_2021 from Nvidia_2022 join Nvidia_2021 where '%' || Nvidia_2022.Research || '%' like '%' || Nvidia_2021.Research || '%'"

大多数情况下,这适用于完全相同的句子。这是输出。

['Such license and development arrangements can further enhance the reach of our technology.'

'Such license and development arrangements can further enhance the reach of our technology.']

然而,有时句子会略有不同,例如:

['We have invested over billion in research and development since our inception, yielding inventions that are essential to modern computing.'

'We have invested over billion in research and development since our inception, yielding inventions that are essential to modern computing.']

290 亿美元对 240 亿美元

或者句末还有其他区别:

'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; cryptocurrency mining processors, or CMP; Jetson for robotics and other embedded platforms; and NVIDIA AI Enterprise and other software.'

'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; and Jetson for robotics and other embedded platforms.'

我的问题:

有没有办法在sqlite或者其他sql数据库中做尽可能多的文本比较工作,然后将最复杂的句子传递给python做一些事情喜欢 levenshtein_distance 或变形金刚句子比较?

或者我是否应该停止使用 SQL 比较查询,并立即着手处理 python 中的繁重工作?

我正在尝试尽可能多地利用 sql,因为它往往比计算 python 中的距离快得多。

Snowflake 等一些实现具有编辑距离: https://docs.snowflake.com/en/sql-reference/functions/editdistance.html

如果您真的想在 sql 中执行此操作,您可以将其标记为

  1. 在 space 上拆分 varchar --> array
  2. unnest/flatten 数组到 CTE
  3. 对句子重复步骤 1 和 2 以将其与
  4. 加入 2 个 CTE 以查看共同的令牌数量

但我认为 sql 对于此类操作不一定更快,并且不如 python 库

健壮

sqlite3 支持 FTS5 Extension.

全文搜索

您必须创建一个 virtual table,然后您可以使用 MATCH 关键字。

-- create a virtual table
CREATE VIRTUAL TABLE email USING fts5(sender, title, body);

-- populate it ...

-- perform a full text search
SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;