如何在字符串中查找包含数字的首字母缩略词

Question

我需要创建一个函数来查找大写首字母缩写词，包括一些包含数字的首字母缩写词，但我只能检测仅包含字母的首字母缩写词。

一个例子：

s= "the EU needs to contribute part of their GDP to improve the IC3 plan"

我试过了

def acronym(s):
    return re.findall(r"\b[A-Z]{2,}\b", s)
print(acronym(s))

但我只得到

[EU,GDP]

我可以添加或更改什么以获得

[EU,GDP,IC3]

谢谢

Answer 1

尝试：

import re

def acronym(s):
    return re.findall(r"\b(?:[0-9]+[A-Z][A-Z0-9]*)|(?:[A-Z][A-Z0-9]+)\b", s)

print(acronym('3I 33 I3 A GDP W3C'))

输出：

['3I', 'I3', 'GDP', 'W3C']

此正则表达式表示：

找到 \b 之间 "word boundaries" 之间的任何单词

以数字（或更多）开头，然后必须至少有一个大写字母，然后可以有其他字母和数字
以大写字母开头，然后至少有另一个大写字母或数字。

?: 允许我们不捕获 2 个组 (()|())，而只能捕获一个。

Answer 2

此正则表达式不会匹配数字（例如 123）：

import re

s = "the EU needs to contribute part of their GDP to improve the IC3 plan"

def acronym(s):
    return re.findall(r"\b([A-Z]{2,}\d*)\b", s)

print(acronym(s))

打印：

['EU', 'GDP', 'IC3']

Regex101 link here.

Answer 3

试试这个。

它类似于 Andrej 和 S. Pellegrino 的答案，但是它不会捕获像 '123' 这样的纯数字字符串，它会捕获在任何位置都有数字的字符串，而不仅仅是在末尾。

图案说明：

\b - 匹配单词边界（字符串的开头）

(?=.*[A-Z]) - 断言后面是大写字母（即字符串至少包含一个大写字母）。这叫积极展望。

[A-Z\d]{2,} - 匹配一个大写字母或数字两次或更多次。

\b - 匹配另一个单词边界（字符串的结尾）。

import re

def acronym(s):
    pattern = r'\b(?=.*[A-Z])[A-Z\d]{2,}\b'
    return re.findall(pattern, s)

编辑： 添加正则表达式模式的说明。

如何在字符串中查找包含数字的首字母缩略词

How to find acronyms containing numbers in a string

string

find

acronym

python-3.x