获取 Python 中某个符号后的段落

Question

我是 python 初学者。

我有一个很大的 txt 文件，格式如下，由许多一个句子的段落组成：

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

****
Sed id placerat magna.

*******
Pellentesque in ex ac urna tincidunt tristique. 

Etiam dapibus faucibus gravida.

我正在尝试将输出作为 仅 星号段落 之后的段落 [每个 星号最少 4 个星号段落 ].

我需要的输出：

Sed id placerat magna.

Pellentesque in ex ac urna tincidunt tristique.

我正在尝试类似的方法，但我不知道 A] 如何设置每个 星号段落的最少 4 个星号 和 B] 如何设置 星号后的段落.

import re

article_content = [open('text.txt').read() ]

after_asterisk_article_paragraph = []
 
string = "****"
after_asterisk_article_paragraph = string[string.find("****")+4:]

print(*after_asterisk_article_paragraph, sep='\n\n')

再一次，我才刚刚开始Python所以请原谅。

Answer 1

您可能会阅读整个文件并使用模式匹配至少 4 次星号，后跟所有不为空或以 4 次星号开头的行。

^\*{4,}((?:\r?\n(?!\s*$|\*{4}).+)*)

^\*{4,} 从字符串开头匹配 4 次或更多次 *
( 捕获 组 1
- (?:非捕获组
  - \r?\n 匹配一个换行符
  - (?!\s*$|\*{4}).+ 如果整行不为空或以 4 次 * 开头并使用负向先行 (?!
- )* 可选择重复该组
) 关闭捕获组 1

Regex demo

例如使用 re.findall 将 return 捕获组 1 值：

import re
file = open('text.txt', mode='r')
result = [s.strip() for s in re.findall(r'^\*{4,}((?:\r?\n(?!\s*$|\*{4}).+)*)', file.read(), re.MULTILINE)]
print(result)
file.close()

输出

['Sed id placerat magna.', 'Pellentesque in ex ac urna tincidunt tristique.']

获取 Python 中某个符号后的段落

Get paragraph after a certain symbol in Python

python

text

extract

paragraph