如何使用正则表达式替换句子中的点 (.),除非它出现在缩写中
How Replace a dot (.) in sentence except when it appears in an abbreviation using regular Expression
我想用 space 替换句子中的每个点,除非它与缩写一起使用。当它与缩写一起使用时,我想将其替换为 ''
NULL.
缩写表示用一个点包围至少两个大写字母。
我的 regex
正在工作,除了他们抓到 U.S.
r1 = r'\b((?:[A-Z]\.){2,})\s*'
r2 = r'(?:[A-Z]\.){2,}'
'U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech'
should become
'USA is abbr x y is not But IIT is also valid ABBVR and so is MTech'
UPDATE: 不应该考虑任何数字或特殊字符。
X.2 -> X 2
X. -> X
X.* -> X -
您可以使用
import re
s='U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(re.sub(r'(?<=[A-Z])(\.)(?=[A-Z])|\.', lambda x: '' if x.group(1) else ' ', s))
# => USA is abbr x y is not But IIT is also valid ABBVR and so is MTech, X 2, X , X *
见Python demo. Here is a regex demo。它匹配
(?<=[A-Z])(\.)(?=[A-Z])
- 第 1 组:一个 .
字符,其前后紧接一个大写 ASCII 字母
|
- 或
\.
- 一个点(在任何其他情况下)
如果第 1 组匹配,则替换为空字符串,否则,替换为 space。
要使其支持 Unicode,请安装 PyPi 正则表达式库 (pip install regex
) 并使用
import regex
s='U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(regex.sub(r'(?<=\p{Lu})(\.)(?=\p{Lu})|\.', lambda x: '' if x.group(1) else ' ', s))
\p{Lu}
匹配任何 Unicode 大写字母。
我想用 space 替换句子中的每个点,除非它与缩写一起使用。当它与缩写一起使用时,我想将其替换为 ''
NULL.
缩写表示用一个点包围至少两个大写字母。
我的 regex
正在工作,除了他们抓到 U.S.
r1 = r'\b((?:[A-Z]\.){2,})\s*'
r2 = r'(?:[A-Z]\.){2,}'
'U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech'
should become
'USA is abbr x y is not But IIT is also valid ABBVR and so is MTech'
UPDATE: 不应该考虑任何数字或特殊字符。
X.2 -> X 2
X. -> X
X.* -> X -
您可以使用
import re
s='U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(re.sub(r'(?<=[A-Z])(\.)(?=[A-Z])|\.', lambda x: '' if x.group(1) else ' ', s))
# => USA is abbr x y is not But IIT is also valid ABBVR and so is MTech, X 2, X , X *
见Python demo. Here is a regex demo。它匹配
(?<=[A-Z])(\.)(?=[A-Z])
- 第 1 组:一个.
字符,其前后紧接一个大写 ASCII 字母|
- 或\.
- 一个点(在任何其他情况下)
如果第 1 组匹配,则替换为空字符串,否则,替换为 space。
要使其支持 Unicode,请安装 PyPi 正则表达式库 (pip install regex
) 并使用
import regex
s='U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(regex.sub(r'(?<=\p{Lu})(\.)(?=\p{Lu})|\.', lambda x: '' if x.group(1) else ' ', s))
\p{Lu}
匹配任何 Unicode 大写字母。