如何将缩写与其含义与正则表达式匹配?
how to match abbreviations with their meaning with regex?
我正在寻找与以下字符串匹配的正则表达式模式:
Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.
我的目标是匹配以下内容:
Some example text (SET)
Energy system models (ESM)
specific optima (SCO)
computer systems (CUST)
outside (OUTS)
重要的是它并不总是恰好是三个单词及其首字母。有时用于缩写的字母仅包含在前面的单词中。这就是我开始研究 positive lookbehind
的原因。但是,它受长度限制,可以通过将它与 positive lookahead
结合使用来解决。到目前为止,我还没有想出一个可靠的解决方案。
到目前为止我尝试过的:
(\b[\w -]+?)\((([A-Z])(?<=(?=.*?))(?:[A-Z]){1,4})\)
这工作得很好,但匹配包含的单词太多:
Some example text (SET)
Energy system models (ESM)
are used to find specific optima (SCO)
Some say Computer systems (CUST)
In the summer playing outside (OUTS)
我还尝试在第一组的开头使用对缩写词首字母的引用。但这根本不起作用。
我看过但没找到有用的东西:
- regex for catching abbreviations
有用的资源:
- something on lookbehinds
- something on lookarounds in general
我建议使用
import re
def contains_abbrev(abbrev, text):
text = text.lower()
if not abbrev.isupper():
return False
cnt = 0
for c in abbrev.lower():
if text.find(c) > -1:
text = text[text.find(c):]
cnt += 1
continue
return cnt == len(abbrev)
text= "Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred. Stupid example(s) Stupid example(S) Not stupid example (NSEMPLE), bad example (Bexle)"
abbrev_rx = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\(([A-Z]*)\)'
print( [x.group() for x in re.finditer(abbrev_rx, text, re.I) if contains_abbrev(x.group(3), x.group(1))] )
参见Python demo。
使用的正则表达式是
(?i)\b(([A-Z])\w*(?:\s+\w+)*?)\s*\(([A-Z]*)\)
见regex demo。详情:
\b
- 单词边界
(([A-Z])\w*(?:\s+\w+)*?)
- 第 1 组 (text
):捕获到第 2 组的 ASCII 字母,然后是 0+ 个单词字符,后跟任何 0 次或多次出现的 1+ 个空格,后跟 1+ 个单词字符,越少越好
\s*
- 0+ 个空格
\(
- 一个 (
字符
([A-Z]*)
- 第 3 组 (abbrev
):与第 2 组中的值相同,然后是 0 个或多个 ASCII 字母
\)
- 一个 )
字符。
一旦匹配,第 3 组作为 abbrev
传递,第 1 组作为 text
传递给 contains_abbrev(abbrev, text)
方法,确保 abbrev
是一个大写字符串,abbrev
中的字符与 text
中的字符顺序相同,并且都出现在 text
.
中
只有正则表达式是不够的..看起来你可能需要一个 python 脚本...
这应该可以处理您的所有情况:
import re
a="Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.";
b=re.findall("(\((.*?)\))",a)
a=a.replace(".","")
i=a.split(' ')
for c in b:
cont=0
m=[]
s=i.index(c[0])
l=len(c[1])
al=s-l
for j in range(al,s+1):
if i[j][0].lower() == c[1][0].lower():
cont=1
if cont == 1:
m.append(i[j])
print(' '.join(m))
输出:
一些示例文本 (SET)
能源系统模型 (ESM)
特定最优 (SCO)
计算机系统 (CUST)
外面(OUTS)
我正在寻找与以下字符串匹配的正则表达式模式:
Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.
我的目标是匹配以下内容:
Some example text (SET)
Energy system models (ESM)
specific optima (SCO)
computer systems (CUST)
outside (OUTS)
重要的是它并不总是恰好是三个单词及其首字母。有时用于缩写的字母仅包含在前面的单词中。这就是我开始研究 positive lookbehind
的原因。但是,它受长度限制,可以通过将它与 positive lookahead
结合使用来解决。到目前为止,我还没有想出一个可靠的解决方案。
到目前为止我尝试过的:
(\b[\w -]+?)\((([A-Z])(?<=(?=.*?))(?:[A-Z]){1,4})\)
这工作得很好,但匹配包含的单词太多:
Some example text (SET)
Energy system models (ESM)
are used to find specific optima (SCO)
Some say Computer systems (CUST)
In the summer playing outside (OUTS)
我还尝试在第一组的开头使用对缩写词首字母的引用。但这根本不起作用。
我看过但没找到有用的东西:
- regex for catching abbreviations
有用的资源:
- something on lookbehinds
- something on lookarounds in general
我建议使用
import re
def contains_abbrev(abbrev, text):
text = text.lower()
if not abbrev.isupper():
return False
cnt = 0
for c in abbrev.lower():
if text.find(c) > -1:
text = text[text.find(c):]
cnt += 1
continue
return cnt == len(abbrev)
text= "Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred. Stupid example(s) Stupid example(S) Not stupid example (NSEMPLE), bad example (Bexle)"
abbrev_rx = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\(([A-Z]*)\)'
print( [x.group() for x in re.finditer(abbrev_rx, text, re.I) if contains_abbrev(x.group(3), x.group(1))] )
参见Python demo。
使用的正则表达式是
(?i)\b(([A-Z])\w*(?:\s+\w+)*?)\s*\(([A-Z]*)\)
见regex demo。详情:
\b
- 单词边界(([A-Z])\w*(?:\s+\w+)*?)
- 第 1 组 (text
):捕获到第 2 组的 ASCII 字母,然后是 0+ 个单词字符,后跟任何 0 次或多次出现的 1+ 个空格,后跟 1+ 个单词字符,越少越好\s*
- 0+ 个空格\(
- 一个(
字符([A-Z]*)
- 第 3 组 (abbrev
):与第 2 组中的值相同,然后是 0 个或多个 ASCII 字母\)
- 一个)
字符。
一旦匹配,第 3 组作为 abbrev
传递,第 1 组作为 text
传递给 contains_abbrev(abbrev, text)
方法,确保 abbrev
是一个大写字符串,abbrev
中的字符与 text
中的字符顺序相同,并且都出现在 text
.
只有正则表达式是不够的..看起来你可能需要一个 python 脚本... 这应该可以处理您的所有情况:
import re
a="Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.";
b=re.findall("(\((.*?)\))",a)
a=a.replace(".","")
i=a.split(' ')
for c in b:
cont=0
m=[]
s=i.index(c[0])
l=len(c[1])
al=s-l
for j in range(al,s+1):
if i[j][0].lower() == c[1][0].lower():
cont=1
if cont == 1:
m.append(i[j])
print(' '.join(m))
输出:
一些示例文本 (SET)
能源系统模型 (ESM)
特定最优 (SCO)
计算机系统 (CUST)
外面(OUTS)