将段落文档拆分为句子

Question

我有一个段落文档数据库。我想拆分 table "master_data" 段落中的每个句子并将其存储到不同的 table "splittext".

master_data table :

id | Title | Paragraph

分文table

id_sen | sentences | doc_id

我尝试使用此查询 select Paragraph.master_data

中的每个句子

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '[^\.\!\* 
[\.\!\?]';

但它会产生括号错误。所以我尝试使用括号，并产生错误 Incorrect Parameter Count

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '([^\.\!\* 
[\.\!\?])';

我的预期结果是该段落被拆分成句子并存储到新的 table。而return段落的原始id存入doc_id.

例如：

master_data :

id | Title | Paragraph  |
 1 | asds..| I want. Some. Coconut and Banana !! |
 2 | wad...| Milkshake? some Nice milk.          |

splittext_table :

id| sentences | doc_id  |

 1|   I want   |    1    |
 2|   Some     |    1    |
           .
           .
           . 
 5| Some Nice milk |   2   |

Answer 1

对于 MySQL 8.0，您可以使用 recursive CTE, given its limitations.

with
  recursive r as (
      select
        1 id,
        cast(regexp_substr(
               Paragraph, '[^.!?]+(?:[.!?]+|$)'
             ) as char(256)) sentences,
        id doc_id, Title, Paragraph
      from master_data
    union all
      select id + 1,
        regexp_substr(
          Paragraph, '[^.!?]+(?:[.!?]+|$)',
          1, id + 1
        ),
        doc_id, Title, Paragraph
      from r
      where sentences is not null
  )
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;

输出：

| id |       sentences       | doc_id | Title  |
+----+-----------------------+--------+--------+
|  1 | I want.               |      1 | asds.. |
|  2 | Some.                 |      1 | asds.. |
|  3 | Coconut and Banana !! |      1 | asds.. |
|  1 | Milkshake?            |      2 | wad... |
|  2 | some Nice milk.       |      2 | wad... |
|  1 | bar                   |      3 | foo    |

演示 DB Fiddle。

将段落文档拆分为句子

Split Paragraph Document into Sentences

regex

mysql

regexp-substr