Python 中的正则表达式：从具有重复相似版本的文本中提取多行部分

Question

在此先感谢您的帮助。我正在使用 Python 正则表达式从具有以下布局的文本中提取一部分：

(A lot of information)

time:    150

C-FXY

-- information ---

E-END

(A lot of information)

time:   5000

C-FXY

**--- INFORMATION I WANT TO EXTRACT ---**

E-END

(A lot of information)

time:  13000

C-FXY

-- information ---

E-END

(A lot of information)

我需要从对应于 5000 的时间步长中提取 C-FXY 和 E-END 之间的所有内容。为此，我使用以下 Python 3.6 句子：

time_step = '5000'
text_part = re.search(r'time.*'+time_step+'.*C-FXY(.*?)E-END', text, re.DOTALL).group(1)

不幸的是，我在输出中得到的是 C-FXY 和 E-END 之间的相同版本，但来自文本的 13000 时间步，而不是我想要的时间：5000。

如有任何帮助，我们将不胜感激。 :)

Answer 1

错误是因为您的正则表达式在 time 部分和 C-FXY 部分之间包含 greedy .*。所以它把所有东西都吃到最后一组。

这里用一个非贪心的版本应该就够了：

text_part = re.search(r'time.*'+time_step+'.*?C-FXY(.*?)E-END', text, re.DOTALL).group(1)

无论如何，我不会在这里对整个文件使用多行搜索，但我会逐行读取文件，直到 time: 5000，然后到 C-FXY，存储从那里到 C-END 的所有内容，并在那里结束处理。

Answer 2

您可以使用以下代码解决：

import re

text = """(A lot of information)

time:    150

C-FXY

-- information ---

E-END

(A lot of information)

time:   5000

C-FXY

**--- INFORMATION I WANT TO EXTRACT ---**

E-END

(A lot of information)

time:  13000

C-FXY

-- information ---

E-END

(A lot of information)"""

pattern = re.compile(r"C-FXY(.*?)E-END")

results = re.findall(r"C-FXY(.*?)E-END", text, re.DOTALL)

现在，如果您打印 results:

for i, r in enumerate(results):
    print(f"Resultado {i}:\n'{r}'")

输出将是：

Resultado 0:
'

-- information ---

'
Resultado 1:
'

**--- INFORMATION I WANT TO EXTRACT ---**

'
Resultado 2:
'

-- information ---

'

Python 中的正则表达式：从具有重复相似版本的文本中提取多行部分

Regex in Python: extract a multiline part from a text with repeating similar editions

python

regex

multiline