修复 BeautifulSoup 代码以从所有页面获取数据并输出到 csv

Fix BeautifulSoup code to get data from all pages and output into csv

完全初学者。请帮忙。我有这段代码,当我没有尝试输出到 .csv 而是在那里有一个打印命令时它起作用了——所以我没有最后两行或与变量 'data' 相关的任何内容。 'worked' 我的意思是它打印了所有 18 页的数据。

现在它将数据输出到 .csv,但仅从第一页 (url)。

我看到我最后没有将 nexturl 传递到 pandas - 因为我不知道该怎么做。非常感谢帮助。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.marketresearch.com/search/results.asp?qtype=2&datepub=3&publisher=Technavio&categoryid=0&sortby=r'


def scrape_it(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
    stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
    reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
    data = []

    for report in reports:
        data.append({
               'title': report.find('a', class_='linkTitle').text,
               'price': report.find('div', class_='resultPrice').text,
               'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
               'detail_link': report.a['href']
        })

    if 'next' not in stri:
        print("All pages completed")
    else:
        scrape_it(nexturl)

    return data

myOutput = pd.DataFrame(scrape_it(url))
myOutput.to_csv(f'results-tec6.csv', header=False)

使 data 全局化,这样您就可以在循环期间继续附加到它而不是重新创建。然后让你的递归函数在 DataFrame() 调用之外被调用,这样你就可以将 data 传递给 pandas.

最后,您可以传递一个 cookie 以获得每个请求的最大可能结果,以减少请求的数量。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.marketresearch.com/search/results.asp?qtype=2&datepub=3&publisher=Technavio&categoryid=0&sortby=r&page=1'

data = []

def scrape_it(url):
    page = requests.get(url, headers = {'Cookie':'ResultsPerPage=100'})
    soup = BeautifulSoup(page.text, 'html.parser')
    nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href']
    stri = soup.find_all(class_="standardLinkDkBlue")[-1].string
    reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]})
    
    for report in reports:
        data.append({
               'title': report.find('a', class_='linkTitle').text,
               'price': report.find('div', class_='resultPrice').text,
               'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''),
               'detail_link': report.a['href']
        })

    if 'next' not in stri:
        print("All pages completed")
    else:
        scrape_it(nexturl)

scrape_it(url)
myOutput = pd.DataFrame(data)
myOutput.to_csv(f'results-tec6.csv', header=False)