我想用漂亮的汤解析多个 HTML 文档，但我无法让它工作

Question

有没有办法用beautiful soup同时解析多个HTML文档？我正在在线修改从 edgar with beautiful soup 中提取 HTML.txt 文件的代码，以便它们可以作为格式化文件下载：但是，我发现我的代码现在只打印一个 edgar 文档（它打算打印 5）并且我不知道有什么问题。

import csv
import requests
import re
from bs4 import BeautifulSoup 

with open('General Motors Co 11-15.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        fn1 = line[0]
        fn2 = re.sub(r'[/\]', '', line[1])
        fn3 = re.sub(r'[/\]', '', line[2])
        fn4 = line[3]
        saveas = '-'.join([fn1, fn2, fn3, fn4])
        # Reorganize to rename the output filename.
        url = 'https://www.sec.gov/Archives/' + line[4].strip()
        bodytext=requests.get(url).text 
        parsedContent=BeautifulSoup(bodytext, 'html.parser')
        for script in parsedContent(["script", "style"]): 
            script.extract()
        text = parsedContent.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk) 
        with open(saveas, 'wb') as f:
            f.write(requests.get('%s' % text).content)
            print(file, 'downloaded and wrote to text file')

你知道我的代码有什么问题吗？

Answer 1

我猜你每次写入文件时都会覆盖现有文档。尝试将 with open(saveas, 'wb') as f: 更改为 with open(saveas, 'ab') as f:

以 wb 的身份打开文件会创建一个与 saveas 同名的新文档，实质上是清除现有文档。

我想用漂亮的汤解析多个 HTML 文档，但我无法让它工作

I want to parse multiple HTML documents with beautiful soup but I can't make it work

beautifulsoup

nltk

mining

edgar