下一页在 bs4 和 pandas 数据框中不起作用

Next pages not working in bs4 and pandas dataframe

我正在尝试使用 BeautifulSoupCSS 选择器和 Pandas DataFrame 从所有 next pages 获取输出,但只获取第一个页作为输出。你能帮助我吗? 谢谢。

代码:

import requests 
import bs4
import pandas as pd
base_url = 'http://quotes.toscrape.com'
res = requests.get("http://quotes.toscrape.com/") 

soup = bs4.BeautifulSoup(res.text,'lxml') 

all_quote = []
all_author = []
all_tag = []

for element in soup.select('.quote'):
    quote = element.select_one('span.text').text
    all_quote.append(quote)
    
    author = element.select_one('small.author').text
    all_author.append(author)
    
    tag = " , ".join(e.text for e in element.select("a.tag"))
    all_tag.append(tag)
    
    next_page = soup.select_one('li.next>a')
    if next_page:
        next_page = base_url + next_page['href']
    else:
        pass

df = pd.DataFrame({'all_quote': all_quote,'all_author':all_author, 'all_tag': all_tag}) 
print(df) 
    

一种方法是创建一个带有您要抓取的参数的递归函数,如果有下一页,则在每次递归调用中更新它们:

import requests 
import bs4
import pandas as pd

base_url = 'http://quotes.toscrape.com'

def scraper(url, all_quote=[], all_author=[], all_tag=[]):
    global base_url
    res = requests.get(url) 
    soup = bs4.BeautifulSoup(res.text,'lxml')

    for element in soup.select('.quote'):
        quote = element.select_one('span.text').text
        all_quote.append(quote)
    
        author = element.select_one('small.author').text
        all_author.append(author)
    
        tag = " , ".join(e.text for e in element.select("a.tag"))
        all_tag.append(tag)
    
        next_page = soup.select_one('li.next>a')
        
        if next_page:
            next_page = base_url + next_page['href']
            return scraper(next_page, all_quote, all_author, all_tag)
        else:
            return [all_quote, all_author, all_tag]

data = scraper('http://quotes.toscrape.com/')

一旦没有next按钮,function returns相应的参数用他们的最新版本。现在您可以设置 all_quoteall_authorall_tag,如下所示:

all_quote, all_author, all_tag = data

尝试:

import bs4
import requests
import pandas as pd

base_url = "http://quotes.toscrape.com"

all_quote = []
all_author = []
all_tag = []

url = base_url

while True:
    print(url)
    res = requests.get(url)
    soup = bs4.BeautifulSoup(res.content, "lxml")

    for element in soup.select(".quote"):
        quote = element.select_one("span.text").text
        all_quote.append(quote)

        author = element.select_one("small.author").text
        all_author.append(author)

        tag = " , ".join(e.text for e in element.select("a.tag"))
        all_tag.append(tag)

    next_page = soup.select_one("li.next>a")
    if next_page:
        url = base_url + next_page["href"]
    else:
        break

df = pd.DataFrame(
    {"all_quote": all_quote, "all_author": all_author, "all_tag": all_tag}
)
print(df)
df.to_csv("data.csv", sep=";", index=None)

保存 data.csv(来自 LibreOffice 的屏幕截图):