我想用漂亮的汤解析多个 HTML 文档,但我无法让它工作
I want to parse multiple HTML documents with beautiful soup but I can't make it work
有没有办法用beautiful soup同时解析多个HTML文档?我正在在线修改从 edgar with beautiful soup 中提取 HTML.txt 文件的代码,以便它们可以作为格式化文件下载:但是,我发现我的代码现在只打印一个 edgar 文档(它打算打印 5)并且我不知道有什么问题。
import csv
import requests
import re
from bs4 import BeautifulSoup
with open('General Motors Co 11-15.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\]', '', line[1])
fn3 = re.sub(r'[/\]', '', line[2])
fn4 = line[3]
saveas = '-'.join([fn1, fn2, fn3, fn4])
# Reorganize to rename the output filename.
url = 'https://www.sec.gov/Archives/' + line[4].strip()
bodytext=requests.get(url).text
parsedContent=BeautifulSoup(bodytext, 'html.parser')
for script in parsedContent(["script", "style"]):
script.extract()
text = parsedContent.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
with open(saveas, 'wb') as f:
f.write(requests.get('%s' % text).content)
print(file, 'downloaded and wrote to text file')
你知道我的代码有什么问题吗?
我猜你每次写入文件时都会覆盖现有文档。尝试将 with open(saveas, 'wb') as f:
更改为 with open(saveas, 'ab') as f:
以 wb
的身份打开文件会创建一个与 saveas
同名的新文档,实质上是清除现有文档。
有没有办法用beautiful soup同时解析多个HTML文档?我正在在线修改从 edgar with beautiful soup 中提取 HTML.txt 文件的代码,以便它们可以作为格式化文件下载:但是,我发现我的代码现在只打印一个 edgar 文档(它打算打印 5)并且我不知道有什么问题。
import csv
import requests
import re
from bs4 import BeautifulSoup
with open('General Motors Co 11-15.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\]', '', line[1])
fn3 = re.sub(r'[/\]', '', line[2])
fn4 = line[3]
saveas = '-'.join([fn1, fn2, fn3, fn4])
# Reorganize to rename the output filename.
url = 'https://www.sec.gov/Archives/' + line[4].strip()
bodytext=requests.get(url).text
parsedContent=BeautifulSoup(bodytext, 'html.parser')
for script in parsedContent(["script", "style"]):
script.extract()
text = parsedContent.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
with open(saveas, 'wb') as f:
f.write(requests.get('%s' % text).content)
print(file, 'downloaded and wrote to text file')
你知道我的代码有什么问题吗?
我猜你每次写入文件时都会覆盖现有文档。尝试将 with open(saveas, 'wb') as f:
更改为 with open(saveas, 'ab') as f:
以 wb
的身份打开文件会创建一个与 saveas
同名的新文档,实质上是清除现有文档。