当存在重叠索引时，使用 re.findall 提取 RegEx 匹配周围的单词

Question

目标是提取关键字“bankruptcy”前后的 100 个字符。

str = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."

pattern = r"(?i)\s*(?:\w|\W){0,100}\b(?:bankruptcy)\b\s*(?:\w|\W){0,100}"

import re

output = re.findall(pattern, str)

预期输出：

['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 
 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

当前输出： ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

有没有办法使用 re.findall 解决重叠索引？

Answer 1

您可以根据PyPi regex module（使用pip install regex安装）使用以下解决方案：

import regex
text = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"\b(?<=(.{0,100}))(bankruptcy)\b(?=(.{0,100}))"
print( [f"{x}{y}{z}" for x,y,z in regex.findall(pattern, text, flags=regex.I|regex.DOTALL)] )
# => ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']

见Python demo online。正则表达式详细信息：

\b - 单词边界
(?<=(.{0,100})) - 正向后视，匹配捕获的任何 0 到 100 个字符之前的位置（注意 regex.DOTALL 允许 . 匹配任何字符）进入第 1 组
(bankruptcy) - 第 2 组：bankruptcy（由于 regex.I 标志，以不区分大小写的方式匹配）
\b - 单词边界
(?=(.{0,100})) - 匹配紧跟 0 到 100 个字符的位置的正前瞻。

由于后瞻和先行不使用它们匹配的模式，您可以访问搜索词左侧和右侧的所有字符。

注意 re 无法使用，因为它不允许在后视中使用非固定宽度的模式。

当存在重叠索引时，使用 re.findall 提取 RegEx 匹配周围的单词

Extract words surrounding a RegEx match using re.findall when there exists an overlapping index

python

regex

overlap

findall

regex-lookarounds