嵌套的 FOR 循环用于我的网络抓取工具中途工作(Beautiful Soup)

Nested FOR loop for my web scraper halfway working (Beautiful Soup)

我正在尝试编写一个网络抓取功能来做一些事情:

  1. 根据 URL 的
  2. 列表确定要抓取的 URL 的数量
  3. 为每个 URL
  4. 创建一个单独的文件
  5. 从每个 URL
  6. 中抓取 TEXT
  7. 将每个文本抓取的结果插入刚刚创建的指定文件

这是当前代码:

#this is the array of URL's

urls = ['https://calevip.org/incentive-project/northern-california',
        'https://www.slocleanair.org/community/grants/altfuel.php',
        'https://www.mcecleanenergy.org/ev-charging/',
        'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
        'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
        'https://afdc.energy.gov/laws/12309',
        'https://cleanvehiclerebate.org/eng/fleet',
        'https://calevip.org/incentive-project/san-joaquin-valley']

import requests
from bs4 import BeautifulSoup
import sys
from websites import urls

def scrape():
    for x in range (len(urls)):
        f = open("test"+str(x)+".txt", 'w')
        for url in urls:
            page = requests.get(url)
            #this line of code creates a Beautiful Soup object that takes page.content as input
            soup = BeautifulSoup(page.content, "html.parser") 
            results = (soup.prettify().encode('cp1252', errors='ignore'))
            #we need a command that enters the results into the file we just created.
            f.write(str(results))


到目前为止,我能够获得执行步骤 1 和 2 的功能。问题是来自第一个网站的文本抓取被放入所有 8 个 .text 文件中,而不是来自的文本抓取第一个网站被放入第一个 .text 文件,第二个网站的文本抓取被放入第二个文件,第三个网站的文本抓取被放入第三个文件......等等

我该如何解决这个问题?我觉得我很接近,但我的第二个 FOR 循环写得不正确。

尝试这样做:-

import requests
from bs4 import BeautifulSoup as BS


urls = ['https://calevip.org/incentive-project/northern-california',
        'https://www.slocleanair.org/community/grants/altfuel.php',
        'https://www.mcecleanenergy.org/ev-charging/',
        'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
        'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
        'https://afdc.energy.gov/laws/12309',
        'https://cleanvehiclerebate.org/eng/fleet',
        'https://calevip.org/incentive-project/san-joaquin-valley']
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

def scrape():
    with requests.Session() as session:
        i = 1
        for url in urls:
            try:
                page = session.get(url, headers=headers)
                page.raise_for_status()
                with open(f'test{i}.txt', 'w') as f:
                    f.write(BS(page.text, 'lxml').prettify())
                    i += 1
            except Exception as e:
                print(f'Exception while processing {url} -> {e}')

if __name__ == '__main__':
    scrape()