将段落文档拆分为句子
Split Paragraph Document into Sentences
我有一个段落文档数据库。我想拆分 table "master_data" 段落中的每个句子
并将其存储到不同的 table "splittext".
master_data table :
id | Title | Paragraph
分文table
id_sen | sentences | doc_id
我尝试使用此查询 select Paragraph.master_data
中的每个句子
SELECT Paragraph FROM pyproject.master_data where REGEXP_SUBSTR '[^\.\!\*
[\.\!\?]';
但它会产生括号错误。所以我尝试使用括号,并产生错误 Incorrect Parameter Count
SELECT Paragraph FROM pyproject.master_data where REGEXP_SUBSTR '([^\.\!\*
[\.\!\?])';
我的预期结果是该段落被拆分成句子并存储到新的 table。而return段落的原始id存入doc_id.
例如:
master_data :
id | Title | Paragraph |
1 | asds..| I want. Some. Coconut and Banana !! |
2 | wad...| Milkshake? some Nice milk. |
splittext_table :
id| sentences | doc_id |
1| I want | 1 |
2| Some | 1 |
.
.
.
5| Some Nice milk | 2 |
对于 MySQL 8.0,您可以使用 recursive CTE, given its limitations.
with
recursive r as (
select
1 id,
cast(regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)'
) as char(256)) sentences,
id doc_id, Title, Paragraph
from master_data
union all
select id + 1,
regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)',
1, id + 1
),
doc_id, Title, Paragraph
from r
where sentences is not null
)
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;
输出:
| id | sentences | doc_id | Title |
+----+-----------------------+--------+--------+
| 1 | I want. | 1 | asds.. |
| 2 | Some. | 1 | asds.. |
| 3 | Coconut and Banana !! | 1 | asds.. |
| 1 | Milkshake? | 2 | wad... |
| 2 | some Nice milk. | 2 | wad... |
| 1 | bar | 3 | foo |
演示 DB Fiddle。
我有一个段落文档数据库。我想拆分 table "master_data" 段落中的每个句子 并将其存储到不同的 table "splittext".
master_data table :
id | Title | Paragraph
分文table
id_sen | sentences | doc_id
我尝试使用此查询 select Paragraph.master_data
中的每个句子SELECT Paragraph FROM pyproject.master_data where REGEXP_SUBSTR '[^\.\!\*
[\.\!\?]';
但它会产生括号错误。所以我尝试使用括号,并产生错误 Incorrect Parameter Count
SELECT Paragraph FROM pyproject.master_data where REGEXP_SUBSTR '([^\.\!\*
[\.\!\?])';
我的预期结果是该段落被拆分成句子并存储到新的 table。而return段落的原始id存入doc_id.
例如:
master_data :
id | Title | Paragraph |
1 | asds..| I want. Some. Coconut and Banana !! |
2 | wad...| Milkshake? some Nice milk. |
splittext_table :
id| sentences | doc_id |
1| I want | 1 |
2| Some | 1 |
.
.
.
5| Some Nice milk | 2 |
对于 MySQL 8.0,您可以使用 recursive CTE, given its limitations.
with
recursive r as (
select
1 id,
cast(regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)'
) as char(256)) sentences,
id doc_id, Title, Paragraph
from master_data
union all
select id + 1,
regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)',
1, id + 1
),
doc_id, Title, Paragraph
from r
where sentences is not null
)
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;
输出:
| id | sentences | doc_id | Title |
+----+-----------------------+--------+--------+
| 1 | I want. | 1 | asds.. |
| 2 | Some. | 1 | asds.. |
| 3 | Coconut and Banana !! | 1 | asds.. |
| 1 | Milkshake? | 2 | wad... |
| 2 | some Nice milk. | 2 | wad... |
| 1 | bar | 3 | foo |
演示 DB Fiddle。