Python 中的正则表达式检测省略号

Regex in Python to detect ellipsis

我有一个大型文本语料库,我想对其进行一些处理,然后基于它训练一个 Word2Vec 模型。有些情况下,由于省略号,单词被删除,如:

But seeing them playing to seven- and eight-year-olds is beautiful

The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous

现在我想撤消这些删除(分别为 inspiredsecond)。这是我写的:

re.sub(r'- (and|to|or)( [^ -]+?){1,2}-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)

但这不起作用,因为如果 and/or/to 和第二个带 - 的单词之间有多个单词,则只会显示第一个。 我想要的输出是:

But seeing them playing to seven-year-olds and eight-year-olds is beautiful

The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous

我找到了解决方案:

re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)

你可以使用

re.sub(r'\b-(\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*))', r'', text)

regex demo详情:

  • \b- - 以单词 char
  • 开头的连字符
  • (\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*)) - 第 1 组:
    • \s+ - 一个或多个空格
    • (?:and|to|or) - andtoor
    • (?:\s+\w+)* - 一个或多个空格后跟一个或多个单词字符出现零次或多次
    • \s+ - 一个或多个空格
    • \w+ - 一个或多个单词字符
    • (-\w[\w-]*) - 第 2 组:一个连字符、一个字符字符,然后是零个或多个字符字符或连字符字符。

参见 Python demo:

import re
texts = ['But seeing them playing to seven- and eight-year-olds is beautiful', 'The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous']
rx = r''
for text in texts:
    print( re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text) )

输出:

But seeing them playing to seven-year-olds and eight-year-olds is beautiful
The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous