当存在重叠索引时,使用 re.findall 提取 RegEx 匹配周围的单词
Extract words surrounding a RegEx match using re.findall when there exists an overlapping index
目标是提取关键字“bankruptcy”前后的 100 个字符。
str = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"(?i)\s*(?:\w|\W){0,100}\b(?:bankruptcy)\b\s*(?:\w|\W){0,100}"
import re
output = re.findall(pattern, str)
预期输出:
['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.',
'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']
当前输出: ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']
有没有办法使用 re.findall
解决重叠索引?
您可以根据PyPi regex module(使用pip install regex
安装)使用以下解决方案:
import regex
text = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"\b(?<=(.{0,100}))(bankruptcy)\b(?=(.{0,100}))"
print( [f"{x}{y}{z}" for x,y,z in regex.findall(pattern, text, flags=regex.I|regex.DOTALL)] )
# => ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']
见Python demo online。正则表达式详细信息:
\b
- 单词边界
(?<=(.{0,100}))
- 正向后视,匹配捕获的任何 0 到 100 个字符之前的位置(注意 regex.DOTALL
允许 .
匹配任何字符)进入第 1 组
(bankruptcy)
- 第 2 组:bankruptcy
(由于 regex.I
标志,以不区分大小写的方式匹配)
\b
- 单词边界
(?=(.{0,100}))
- 匹配紧跟 0 到 100 个字符的位置的正前瞻。
由于后瞻和先行不使用它们匹配的模式,您可以访问搜索词左侧和右侧的所有字符。
注意 re
无法使用,因为它不允许在后视中使用非固定宽度的模式。
目标是提取关键字“bankruptcy”前后的 100 个字符。
str = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"(?i)\s*(?:\w|\W){0,100}\b(?:bankruptcy)\b\s*(?:\w|\W){0,100}"
import re
output = re.findall(pattern, str)
预期输出:
['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.',
'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']
当前输出: ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']
有没有办法使用 re.findall
解决重叠索引?
您可以根据PyPi regex module(使用pip install regex
安装)使用以下解决方案:
import regex
text = "The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s."
pattern = r"\b(?<=(.{0,100}))(bankruptcy)\b(?=(.{0,100}))"
print( [f"{x}{y}{z}" for x,y,z in regex.findall(pattern, text, flags=regex.I|regex.DOTALL)] )
# => ['The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.', 'The company announced bankruptcy on jan 1, 1900. Many more companies announced bankruptcy in 1920s.']
见Python demo online。正则表达式详细信息:
\b
- 单词边界(?<=(.{0,100}))
- 正向后视,匹配捕获的任何 0 到 100 个字符之前的位置(注意regex.DOTALL
允许.
匹配任何字符)进入第 1 组(bankruptcy)
- 第 2 组:bankruptcy
(由于regex.I
标志,以不区分大小写的方式匹配)\b
- 单词边界(?=(.{0,100}))
- 匹配紧跟 0 到 100 个字符的位置的正前瞻。
由于后瞻和先行不使用它们匹配的模式,您可以访问搜索词左侧和右侧的所有字符。
注意 re
无法使用,因为它不允许在后视中使用非固定宽度的模式。