如何使用正则表达式替换句子中的点 (.)，除非它出现在缩写中

Question

我想用 space 替换句子中的每个点，除非它与缩写一起使用。当它与缩写一起使用时，我想将其替换为 '' NULL.

缩写表示用一个点包围至少两个大写字母。

我的 regex 正在工作，除了他们抓到 U.S.

r1 = r'\b((?:[A-Z]\.){2,})\s*'
r2 = r'(?:[A-Z]\.){2,}'

'U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech'

should become

'USA is abbr  x y  is not But IIT is also valid ABBVR and so is MTech'

UPDATE: 不应该考虑任何数字或特殊字符。

X.2 -> X 2
X. -> X 
X.* -> X -

Answer 1

您可以使用

import re
s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(re.sub(r'(?<=[A-Z])(\.)(?=[A-Z])|\.', lambda x: '' if x.group(1) else ' ', s))
# =>  USA is abbr  x y  is not  But IIT  is also valid ABBVR and so is MTech, X 2, X , X *

见Python demo. Here is a regex demo。它匹配

(?<=[A-Z])(\.)(?=[A-Z]) - 第 1 组：一个 . 字符，其前后紧接一个大写 ASCII 字母
| - 或
\. - 一个点（在任何其他情况下）

如果第 1 组匹配，则替换为空字符串，否则，替换为 space。

要使其支持 Unicode，请安装 PyPi 正则表达式库 (pip install regex) 并使用

import regex
s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(regex.sub(r'(?<=\p{Lu})(\.)(?=\p{Lu})|\.', lambda x: '' if x.group(1) else ' ', s))

\p{Lu} 匹配任何 Unicode 大写字母。

如何使用正则表达式替换句子中的点 (.)，除非它出现在缩写中

How Replace a dot (.) in sentence except when it appears in an abbreviation using regular Expression

python

regex

nlp

data-cleaning

python-re