如何从涉及html table的Beautiful Soup页面抓取产品信息

Question

import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://books.toscrape.com/'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://books.toscrape.com/' )
soup=BeautifulSoup(r.content, 'html.parser')
productlinks=[]
Title=[]
Brand=[]
tra = soup.find_all('article',class_='product_pod')
for links in tra:
    for link in links.find_all('a',href=True)[1:]:
        comp=baseurl+link['href']
        productlinks.append(comp)

for link in productlinks:
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'html.parser')
    try:
        title=soup.find('h3').text
    except:
        title=' '
    Title.append(title)
    price=soup.find('p',class_="price_color").text.replace('£','').replace(',','').strip()
    Brand.append(price)

df = pd.DataFrame(
    
    {"Title": Title, "Price": price}
)
print(df)

上面的脚本按预期工作，但我想抓取每个产品的信息，例如upc、product type example获取这些单页的信息 https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html 抓取 upc ,product type 等...所有其他信息位于 产品信息

Answer 1

您可以在 URL 中使用 start= 参数来获取下一页：

import requests
from bs4 import BeautifulSoup

for page in range(0, 10):  # <-- increase number of pages here
    r = requests.get(
        "https://pk.indeed.com/jobs?q=&l=Lahore&start={}".format(page * 10)
    )
    soup = BeautifulSoup(r.content, "html.parser")
    title = soup.find_all("h2", class_="jobTitle")

    for i in title:
        print(i.text)

打印：

Data Entry Work Online
newAdmin Assistant
newNCG Agent
Data Entry Operator
newResearch Associate Electrical
Administrative Assistant (Executive Assistant)
Admin Assistant Digitally
newIT Officer (Remote Work)
OFFICE ASSISTANT
Cash Officer - Lahore Region
newDeputy Manager Finance
Admin Assistant
Lab Assistant
newProduct Portfolio & Customer Service Specialist
Front Desk Officer
newRelationship Manager, Recovery
MANAGEMENT TRAINEE PROGRAM
Email Support Executive (International)
Data Entry Operator
Admin officer

...and so on.

如何从涉及html table的Beautiful Soup页面抓取产品信息

How to scrape the product information from the page using Beautiful Soup in which html table are involved

python

html-table

beautifulsoup

web-scraping

html-tableextract