如何将 txt 文件拆分为多个文件，不包括具有特定内容的行

Question

我有一个较大的 .txt 文件，我想将其拆分为多个较小的 .txt 文件，因此在每个较小的 .txt 文件中都留下了可读的段落。

但是我想做的是排除源文件的某些部分被写入较小的文件。（即，如果行不是以 <p> 开头，则不要写入文件）。

这是我的代码 - 它工作正常，除了它生成了一些我不想要的文件：

import mmap
import re

filenumber = 0

out_file = None

with open('main.txt') as x:
    for line in x:
        if line.strip() == '<p>':
             filenumber += 1
            out_file = open('narrative%03d.txt' % filenumber, 'w')
        elif line.strip().startswith('</p>') and out_file:
            out_file.close()
            out_file = None
        elif out_file:
            out_file.write(line)
if out_file:
    out_file.close()

我想做的是想出一种表达方式 - 运行代码，但是如果一行不是以 <p> 开头，那么什么都不做，然后继续其余代码。

如有任何帮助，我们将不胜感激！如果我没有提供足够的信息，请告诉我！

由于源文件包含 html 标签，我向您展示源文件的最简单方法是为其提供 link：

https://archive.org/stream/warandpeace030164mbp/warandpeace030164mbp_djvu.txt

查看源代码以查看我不想包含的部分。

我只想要书中的段落-

即

他的女儿 He*lene 公主去世了- 在椅子之间，轻轻地托起褶皱她的裙子，笑容更加灿烂闪耀在她美丽的脸上。皮埃尔注视着用狂喜的、几乎是害怕的眼神看着她当她经过他身边时。

"Very lovely,"安德鲁王子说。

我不想要包含所有 html 和章节列表等的文档开头

Answer 1

对于您提供的 link，整个文本包含在一个巨大的 <pre>...</pre> 块中。因此，您可以使用 BeautifulSoup 轻松提取它。

首先使用 requests, extract the text containing the single pre using BeautifulSoup 之类的方式获取 html，然后根据双换行符拆分文本并删除所有空条目：

from bs4 import BeautifulSoup
import requests

html = requests.get('https://archive.org/stream/warandpeace030164mbp/warandpeace030164mbp_djvu.txt')
soup = BeautifulSoup(html.text, "lxml")
war_and_peace = soup.pre.get_text()

paragraphs = war_and_peace.split('\n\n')
paragraphs[:] = [p for p in paragraphs if len(p)]       # Remove empty entries

print paragraphs[671]

结果将是一个段落列表。该脚本将显示以下内容：

His daughter, Princess He*lene, passed be- 
tween the chairs, lightly holding up the folds 
of her dress, and the smile shone still more 
radiantly on her beautiful face. Pierre gazed 
at her with rapturous, almost frightened, eyes 
as she passed him.

如何将 txt 文件拆分为多个文件，不包括具有特定内容的行

How to split a txt file into multiple files excluding lines with certain content

html

python

regex

beautifulsoup

startswith