如何在 python 中使用正则表达式从文件中提取特定段落?

How to extract a certain paragraph from a file use regex in python?

我的问题是通过Python.

中的正则表达式从文件中提取特定段落(例如,通常是中间段落)

示例文件如下:

poem = """The time will come
when, with elation,
you will greet yourself arriving
at your own door, in your own mirror,
and each will smile at the other's welcome,
and say, sit here. Eat.
You will love again the stranger who was your self.
Give wine. Give bread. Give back your heart
to itself, to the stranger who has loved you

all your life, whom you ignored
for another, who knows you by heart.
Take down the love letters from the bookshelf,

the photographs, the desperate notes,
peel your own image from the mirror.
Sit. Feast on your life."""

如何在python中使用正则表达式提取这首诗的第二段(即"all you life ... the bookshelf,")?

使用积极的向前看和向后看:

(?<=\n\n).+(?=\n\n)

开头的(?<=\n\n)有回头看。如果它后面有\n\n,它只会匹配它后面的东西。

最后一位 (?=\n\n) 是前瞻性的,只有在它后面有 \n\n 的情况下才会匹配它之前的东西。

试试看:https://regex101.com/r/7XnDjS/1

某些 Windows 文本文件以 \r\n 结束一行可能很重要,而不仅仅是 \n。 Python 有关于正则表达式的优秀文档。只是 google "python regexp"。你甚至可以 google "perl regexp" 因为 Python 从 Perl 复制正则表达式 ;-) 仅获取第二段文本的一种方法是使用 () 来获取两组两个或更多行结尾之间的文本,如下所示:

myPattern = re.compile('[^\r\n]+\r?\n\r?\n+([^\r\n]+)\r?\n\r?\n.*')

然后像这样使用它:

secondPara = myPattern.sub("\1", content)

这是我的脚本:

schumack@linux2 137> ./poem2.py
secondPara: all your life, whom you ignored for another, who knows you by heart. Take down the love letters from the bookshelf,

使用群组捕获并尝试一下:

import re


pattern=r'^(all.*bookshelf[,\s])'

second=re.search(pattern,poem,re.MULTILINE | re.DOTALL)
print(second.group(0))