当存在重叠索引时,使用 re.findall 提取 RegEx 匹配周围的单词

Extract words surrounding a RegEx match using re.findall when there exists an overlapping index

目标是提取关键字“bankruptcy”前后的 100 个字符。

str = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."

pattern = r"(?i)\s*(?:\w|\W){0,100}\b(?:bankruptcy)\b\s*(?:\w|\W){0,100}"

import re

output = re.findall(pattern, str)

预期输出:

['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 
 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

当前输出: ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

有没有办法使用 re.findall 解决重叠索引?

您可以根据PyPi regex module(使用pip install regex安装)使用以下解决方案:

import regex
text = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"\b(?<=(.{0,100}))(bankruptcy)\b(?=(.{0,100}))"
print( [f"{x}{y}{z}" for x,y,z in regex.findall(pattern, text, flags=regex.I|regex.DOTALL)] )
# => ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

Python demo online。正则表达式详细信息:

  • \b - 单词边界
  • (?<=(.{0,100})) - 正向后视,匹配捕获的任何 0 到 100 个字符之前的位置(注意 regex.DOTALL 允许 . 匹配任何字符)进入第 1 组
  • (bankruptcy) - 第 2 组:bankruptcy(由于 regex.I 标志,以不区分大小写的方式匹配)
  • \b - 单词边界
  • (?=(.{0,100})) - 匹配紧跟 0 到 100 个字符的位置的正前瞻。

由于后瞻和先行不使用它们匹配的模式,您可以访问搜索词左侧和右侧的所有字符。

注意 re 无法使用,因为它不允许在后视中使用非固定宽度的模式。