python 正则表达式：只匹配点，不匹配它前面的字母

Question

我有一个正则表达式模式如下：

r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+'

我正在尝试修改它，使其只匹配句子末尾的点，而不匹配句子前面的字母。这是我的字符串：

sent = 'This is the U.A. we have r.a.d. golden 13.56 date. a better date 34. was there.'

这是我所做的：

import re
re.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+', sent)

然而发生的是它删除了单词的最后一个字母：

current output:
['This is the U.A. we have r.a.d. golden 13.56 dat',' a better date 34. was ther',
 '']

我想要的输出是：

['This is the U.A. we have r.a.d. golden 13.56 date',' a better date 34. was there',
 '']

我不知道如何修改模式以保留单词 'date' 和 'there'

的最后一个字母

Answer 1

您的模式可以减少并固定为

(?<=(?<![.\s])[a-zA-Z])\.

参见regex demo。

如果您还需要匹配多个点，请在 \. 之后放回 +。

详情:

(?<=(?<![.\s])[a-zA-Z]) - 与紧接在前面的位置相匹配的正后视
- (?<![.\s]) - 如果在当前位置
- [a-zA-Z] - ASCII 字母
\. - 字面上的点。

看，您的模式基本上是 (?<!\.|\s)[a-z]\. 和 (?<!\.|\s)[A-Z]\. 两种模式的交替，两者之间的唯一区别是 [a-z] 和 [A-Z]。很明显，相同的alternation可以缩短为(?<!\.|\s)[a-zA-Z]\. [a-zA-Z]必须放入非消耗模式，这样拆分时字母就不会被吃掉，所以使用positive lookbehind是很自然的解决方案。

python 正则表达式：只匹配点，不匹配它前面的字母

python regex: match the dot only, not the letter before it

python

regex

pattern-matching