Python 中的正则表达式检测省略号
Regex in Python to detect ellipsis
我有一个大型文本语料库,我想对其进行一些处理,然后基于它训练一个 Word2Vec 模型。有些情况下,由于省略号,单词被删除,如:
But seeing them playing to seven- and eight-year-olds is beautiful
或
The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous
现在我想撤消这些删除(分别为 inspired 和 second)。这是我写的:
re.sub(r'- (and|to|or)( [^ -]+?){1,2}-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)
但这不起作用,因为如果 and/or/to
和第二个带 -
的单词之间有多个单词,则只会显示第一个。
我想要的输出是:
But seeing them playing to seven-year-olds and eight-year-olds is beautiful
和
The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous
我找到了解决方案:
re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)
你可以使用
re.sub(r'\b-(\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*))', r'', text)
见regex demo。 详情:
\b-
- 以单词 char 开头的连字符
(\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*))
- 第 1 组:
\s+
- 一个或多个空格
(?:and|to|or)
- and
、to
或 or
(?:\s+\w+)*
- 一个或多个空格后跟一个或多个单词字符出现零次或多次
\s+
- 一个或多个空格
\w+
- 一个或多个单词字符
(-\w[\w-]*)
- 第 2 组:一个连字符、一个字符字符,然后是零个或多个字符字符或连字符字符。
参见 Python demo:
import re
texts = ['But seeing them playing to seven- and eight-year-olds is beautiful', 'The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous']
rx = r''
for text in texts:
print( re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text) )
输出:
But seeing them playing to seven-year-olds and eight-year-olds is beautiful
The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous
我有一个大型文本语料库,我想对其进行一些处理,然后基于它训练一个 Word2Vec 模型。有些情况下,由于省略号,单词被删除,如:
But seeing them playing to seven- and eight-year-olds is beautiful
或
The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous
现在我想撤消这些删除(分别为 inspired 和 second)。这是我写的:
re.sub(r'- (and|to|or)( [^ -]+?){1,2}-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)
但这不起作用,因为如果 and/or/to
和第二个带 -
的单词之间有多个单词,则只会显示第一个。
我想要的输出是:
But seeing them playing to seven-year-olds and eight-year-olds is beautiful
和
The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous
我找到了解决方案:
re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text)
你可以使用
re.sub(r'\b-(\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*))', r'', text)
见regex demo。 详情:
\b-
- 以单词 char 开头的连字符
(\s+(?:and|to|or)(?:\s+\w+)*\s+\w+(-\w[\w-]*))
- 第 1 组:\s+
- 一个或多个空格(?:and|to|or)
-and
、to
或or
(?:\s+\w+)*
- 一个或多个空格后跟一个或多个单词字符出现零次或多次\s+
- 一个或多个空格\w+
- 一个或多个单词字符(-\w[\w-]*)
- 第 2 组:一个连字符、一个字符字符,然后是零个或多个字符字符或连字符字符。
参见 Python demo:
import re
texts = ['But seeing them playing to seven- and eight-year-olds is beautiful', 'The country was in the uproar of pre- and then post-independence civil war but the mood here is most often joyous']
rx = r''
for text in texts:
print( re.sub(r'- (and|to|or)((?: [^ -]+?){1,2})-(.+?)( |$|\n)', '-\3 \1\2-\3\4', text) )
输出:
But seeing them playing to seven-year-olds and eight-year-olds is beautiful
The country was in the uproar of pre-independence and then post-independence civil war but the mood here is most often joyous