Python 中的正则表达式检测省略号

Question

我有一个大型文本语料库，我想对其进行一些处理，然后基于它训练一个 Word2Vec 模型。有些情况下，由于省略号，单词被删除，如：

But seeing them playing to seven- and eight-year-olds is beautiful

或

The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous

现在我想撤消这些删除（分别为 inspired 和 second）。这是我写的：

re.sub(r'- (and|to|or)( [^ -]+?){1,2}-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)

但这不起作用，因为如果 and/or/to 和第二个带 - 的单词之间有多个单词，则只会显示第一个。我想要的输出是：

But seeing them playing to seven-year-olds and eight-year-olds is beautiful

和

The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous

Answer 1

我找到了解决方案：

re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)

Answer 2

你可以使用

re.sub(r'\b-(\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*))', r'', text)

见regex demo。详情:

\b- - 以单词 char
(\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*)) - 第 1 组：
- \s+ - 一个或多个空格
- (?:and|to|or) - and、to 或 or
- (?:\s+\w+)* - 一个或多个空格后跟一个或多个单词字符出现零次或多次
- \s+ - 一个或多个空格
- \w+ - 一个或多个单词字符
- (-\w[\w-]*) - 第 2 组：一个连字符、一个字符字符，然后是零个或多个字符字符或连字符字符。

参见 Python demo:

import re
texts = ['But seeing them playing to seven- and eight-year-olds is beautiful', 'The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous']
rx = r''
for text in texts:
    print( re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text) )

输出：

But seeing them playing to seven-year-olds and eight-year-olds is beautiful
The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous

Python 中的正则表达式检测省略号

Regex in Python to detect ellipsis

python

regex

text-processing

nlp