Python 匹配整个单词的正则表达式(减去缩略词和所有格)
Python regex to match whole words (minus contractions and possessives)
我正在尝试在 Python 中使用正则表达式从文本中捕获整个单词。这很简单,但我还想删除由撇号表示的缩写和所有格。
目前我有(?iu)(?<!')(?!n')[\w]+
正在测试以下文本
One tree or many trees? My tree's green. I didn't figure this out yet.
给出这些火柴
One tree or many trees My tree green I didn figure this out yet
在此示例中,负向后视阻止了撇号后的 "s" 和 "t" 作为整个单词进行匹配。但是我如何编写否定前瞻 (?!n')
以便匹配包括 "did" 而不是 "didn"?
(我的用例是一个简单的 Python 拼写检查器,每个单词都会被验证是否拼写正确。我最终使用 autocorrect module 作为 pyenchant,aspell-python 和其他人在通过 pip 安装时不起作用)
我会使用这个正则表达式:
(?<![\w'])\w+?(?=\b|n't)
这会匹配单词字符,直到遇到 n't
。
结果:
>>> re.findall(r"(?<![\w'])\w+?(?=\b|n't)", "One tree or many trees? My tree's green. I didn't figure this out yet.")
['One', 'tree', 'or', 'many', 'trees', 'My', 'tree', 'green', 'I', 'did', 'figure', 'this', 'out', 'yet']
细分:
(?<! # negative lookbehind: assert the text is not preceded by...
[\w'] # ... a word character or apostrophe
)
\w+? # match word characters, as few as necessary, until...
(?=
\b # ... a word boundary...
| # ... or ...
n't # ... the text "n't"
)
我正在尝试在 Python 中使用正则表达式从文本中捕获整个单词。这很简单,但我还想删除由撇号表示的缩写和所有格。
目前我有(?iu)(?<!')(?!n')[\w]+
正在测试以下文本
One tree or many trees? My tree's green. I didn't figure this out yet.
给出这些火柴
One tree or many trees My tree green I didn figure this out yet
在此示例中,负向后视阻止了撇号后的 "s" 和 "t" 作为整个单词进行匹配。但是我如何编写否定前瞻 (?!n')
以便匹配包括 "did" 而不是 "didn"?
(我的用例是一个简单的 Python 拼写检查器,每个单词都会被验证是否拼写正确。我最终使用 autocorrect module 作为 pyenchant,aspell-python 和其他人在通过 pip 安装时不起作用)
我会使用这个正则表达式:
(?<![\w'])\w+?(?=\b|n't)
这会匹配单词字符,直到遇到 n't
。
结果:
>>> re.findall(r"(?<![\w'])\w+?(?=\b|n't)", "One tree or many trees? My tree's green. I didn't figure this out yet.")
['One', 'tree', 'or', 'many', 'trees', 'My', 'tree', 'green', 'I', 'did', 'figure', 'this', 'out', 'yet']
细分:
(?<! # negative lookbehind: assert the text is not preceded by...
[\w'] # ... a word character or apostrophe
)
\w+? # match word characters, as few as necessary, until...
(?=
\b # ... a word boundary...
| # ... or ...
n't # ... the text "n't"
)