通过正则表达式获取重复内容

Question

我有一些格式的内容：

text = """Pos no
...
... 25/gm
The Text to be 
...
excluded
Pos no
...
... 46 kg
The Text to be 
...
excluded
Pos no
...
... 46 xunit
End of My Text

在哪里， Pos no... 25/gm - 这是一种表格结构，我必须从中提取值。

The Text to be ... excluded - 这有恒定的开始（可以说 The Text to be）但没有明确的结束，即 excluded 可能不存在。

End of My Text - 此文本将始终存在。

我想要一个仅包含表格内容的列表，即

["Pos no
...
... 25/gm",
"Pos no
...
... 46 kg",
"Pos no
...
... 46 xunit"]

这是我的尝试，但没有获取正确的列表：

re.findall(r'(Pos no .+?)(?: |The Text to be|End of My Text)', text, re.DOTALL | re.M)

Answer 1

您可以使用

re.findall(r'(?sm)(Pos no\r?\n.+?)[\r\n]+(?:The Text to be|End of My Text)', text)

请注意 Pos no 没有 space，但您的模式需要它。另外，只在行首匹配右侧上下文，匹配更安全。

图案详情

(?sm) - re.DOTALL 和 re.MULTILINE 内联修饰符（用于较短的代码）
(Pos no\r?\n.+?) - 第 1 组（re.findall 返回的内容）：
- Pos no - 文字子串
- \r?\n - CRLF 或 LF 换行符
- .+? - 任何 1+ 个字符，尽可能少，直到最左边出现的后续子模式
[\r\n]+ - 1+ 个换行符
(?:The Text to be|End of My Text) - 两个子串之一，The Text to be 或 End of My Text.

Get repeating content by regex