检测特定模式正则表达式 python

Question

我想在 word 文档中查找特定术语（及其变体）的所有出现。

从word文档中提取了文本
尝试通过正则表达式查找模式

该模式由以 DOC- 开头的单词组成，在 - 之后有 9 位数字。

我尝试了以下但没有成功：

文档变量是提取的文本，函数如下：

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

pattern = re.compile('^DOC.\d{9}$')
pattern.findall（文档）

pattern.findall（文档）

有人可以帮助我吗？

提前致谢

Answer 1

您可以使用单词和数字右手边界的组合。

另外，你说 DOC 后必须有破折号，但你在模式中使用了 .。我相信您还想匹配任何 en- 或 em-dash，所以我建议使用更精确的模式，例如 [-–—]。请注意，还有其他方法可以匹配任何 Unicode 破折号字符，请参阅 .

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print( re.findall(r'\bDOC[-–—]\d{9}(?!\d)', getText(filename)) )

详情:

\b - 单词边界
DOC - DOC 子字符串
[-–—] - 破折号（连字符、en- 或 em-破折号）
\d{9} - 九位数
(?!\d) - 当前位置的右边，不能有数字。

检测特定模式正则表达式 python

Detect specific pattern regex python

python

regex

pattern-matching