嵌套的 FOR 循环用于我的网络抓取工具中途工作(Beautiful Soup)
Nested FOR loop for my web scraper halfway working (Beautiful Soup)
我正在尝试编写一个网络抓取功能来做一些事情:
- 根据 URL 的
列表确定要抓取的 URL 的数量
- 为每个 URL
创建一个单独的文件
- 从每个 URL
中抓取 TEXT
- 将每个文本抓取的结果插入刚刚创建的指定文件
这是当前代码:
#this is the array of URL's
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
import requests
from bs4 import BeautifulSoup
import sys
from websites import urls
def scrape():
for x in range (len(urls)):
f = open("test"+str(x)+".txt", 'w')
for url in urls:
page = requests.get(url)
#this line of code creates a Beautiful Soup object that takes page.content as input
soup = BeautifulSoup(page.content, "html.parser")
results = (soup.prettify().encode('cp1252', errors='ignore'))
#we need a command that enters the results into the file we just created.
f.write(str(results))
到目前为止,我能够获得执行步骤 1 和 2 的功能。问题是来自第一个网站的文本抓取被放入所有 8 个 .text 文件中,而不是来自的文本抓取第一个网站被放入第一个 .text 文件,第二个网站的文本抓取被放入第二个文件,第三个网站的文本抓取被放入第三个文件......等等
我该如何解决这个问题?我觉得我很接近,但我的第二个 FOR 循环写得不正确。
尝试这样做:-
import requests
from bs4 import BeautifulSoup as BS
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def scrape():
with requests.Session() as session:
i = 1
for url in urls:
try:
page = session.get(url, headers=headers)
page.raise_for_status()
with open(f'test{i}.txt', 'w') as f:
f.write(BS(page.text, 'lxml').prettify())
i += 1
except Exception as e:
print(f'Exception while processing {url} -> {e}')
if __name__ == '__main__':
scrape()
我正在尝试编写一个网络抓取功能来做一些事情:
- 根据 URL 的 列表确定要抓取的 URL 的数量
- 为每个 URL 创建一个单独的文件
- 从每个 URL 中抓取 TEXT
- 将每个文本抓取的结果插入刚刚创建的指定文件
这是当前代码:
#this is the array of URL's
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
import requests
from bs4 import BeautifulSoup
import sys
from websites import urls
def scrape():
for x in range (len(urls)):
f = open("test"+str(x)+".txt", 'w')
for url in urls:
page = requests.get(url)
#this line of code creates a Beautiful Soup object that takes page.content as input
soup = BeautifulSoup(page.content, "html.parser")
results = (soup.prettify().encode('cp1252', errors='ignore'))
#we need a command that enters the results into the file we just created.
f.write(str(results))
到目前为止,我能够获得执行步骤 1 和 2 的功能。问题是来自第一个网站的文本抓取被放入所有 8 个 .text 文件中,而不是来自的文本抓取第一个网站被放入第一个 .text 文件,第二个网站的文本抓取被放入第二个文件,第三个网站的文本抓取被放入第三个文件......等等
我该如何解决这个问题?我觉得我很接近,但我的第二个 FOR 循环写得不正确。
尝试这样做:-
import requests
from bs4 import BeautifulSoup as BS
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def scrape():
with requests.Session() as session:
i = 1
for url in urls:
try:
page = session.get(url, headers=headers)
page.raise_for_status()
with open(f'test{i}.txt', 'w') as f:
f.write(BS(page.text, 'lxml').prettify())
i += 1
except Exception as e:
print(f'Exception while processing {url} -> {e}')
if __name__ == '__main__':
scrape()