从文本文件创建多个 txt 文件

Question

我正在尝试从 Project Gutenberg 获取 Federalist Papers 并将其转换为文本文档。古腾堡计划的问题是每篇论文都没有分开 - 它作为一个大文本文件读入，所以我必须告诉 Python 为每篇联邦党人论文创建一个新的文本文件（它们都包含在短语 "FEDERALIST No. _" 和 "PUBLIUS").

我的代码大部分都有效，但我运行遇到的问题是它创建的第一个文本文件（根据我的代码命名为 1.txt）。当我打开这个文件时，它包含了从 Project Gutenberg 中截取的整个原始文本，而不仅仅是 Federalist 1 的文本。文件 2.txt 然后只有 Federalist 1 的内容，它正确地剪切了文本，它只是现在从它应该是的文件偏移 1.

我怀疑我的问题出在 for 循环中的某个地方，可能与我初始化变量的方式有关，但我看不出是什么地方导致了这个错误。

# Importing the doc and creating individual txt files for each federalist paper

url = "https://www.gutenberg.org/files/1404/1404.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

# finding the start and end of the portion of the doc we care about and subsetting
raw.find("FEDERALIST No. 1")
raw.rfind("PUBLIUS")
raw = raw[821:1167459]
# fixing the doc again... yeah this ain't clean but it's right
raw = raw[0:1166638]
# save as txt to work with below
print(raw, file=open("all.txt", "a"))

# looping over the whole text to break it into individual text docs by each
# federalist paper
with open("all.txt") as fo:
    op = ''
    start = 0
    cntr = 1
    paper = 1
    for x in fo.read().split("\n"):  # looping over the text by each line split
        if x == 'FEDERALIST No. ' + str(paper):  # creating new txt if we
                                                 # encounter a new fed paper
            if start == 1:
                with open(str(cntr) + '.txt', 'w') as opf:
                    opf.write(op)
                    opf.close()
                    op = ''
                    cntr += 1
                    paper += 1
            else:
                start = 1
        else:
            if op == '':
                op = x
            else:
                op = op + '\n' + x
    fo.close()

Answer 1

您可以使用 re 模块来拆分文本：

import re
import requests


url = "https://www.gutenberg.org/files/1404/1404.txt"
text = requests.get(url).text

r = re.compile(
    r"^(FEDERALIST No\..*?)(?=^PUBLIUS|^FEDERALIST)", flags=re.M | re.S
)
for i, section in enumerate(r.findall(text), 1):
    with open("{}.txt".format(i), "w") as f_out:
        f_out.write(section)

这将创建 85 个 .txt 个文件，每个文件包含论文中的部分。

从文本文件创建多个 txt 文件

Creating multiple txt files from a text file

python

text

loops

web-scraping

txt