如何从涉及html table的Beautiful Soup页面抓取产品信息
How to scrape the product information from the page using Beautiful Soup in which html table are involved
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://books.toscrape.com/'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://books.toscrape.com/' )
soup=BeautifulSoup(r.content, 'html.parser')
productlinks=[]
Title=[]
Brand=[]
tra = soup.find_all('article',class_='product_pod')
for links in tra:
for link in links.find_all('a',href=True)[1:]:
comp=baseurl+link['href']
productlinks.append(comp)
for link in productlinks:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
try:
title=soup.find('h3').text
except:
title=' '
Title.append(title)
price=soup.find('p',class_="price_color").text.replace('£','').replace(',','').strip()
Brand.append(price)
df = pd.DataFrame(
{"Title": Title, "Price": price}
)
print(df)
上面的脚本按预期工作,但我想抓取每个产品的信息,例如upc
、product type
example获取这些单页的信息
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
抓取 upc
,product type
等...所有其他信息位于 产品信息
您可以在 URL 中使用 start=
参数来获取下一页:
import requests
from bs4 import BeautifulSoup
for page in range(0, 10): # <-- increase number of pages here
r = requests.get(
"https://pk.indeed.com/jobs?q=&l=Lahore&start={}".format(page * 10)
)
soup = BeautifulSoup(r.content, "html.parser")
title = soup.find_all("h2", class_="jobTitle")
for i in title:
print(i.text)
打印:
Data Entry Work Online
newAdmin Assistant
newNCG Agent
Data Entry Operator
newResearch Associate Electrical
Administrative Assistant (Executive Assistant)
Admin Assistant Digitally
newIT Officer (Remote Work)
OFFICE ASSISTANT
Cash Officer - Lahore Region
newDeputy Manager Finance
Admin Assistant
Lab Assistant
newProduct Portfolio & Customer Service Specialist
Front Desk Officer
newRelationship Manager, Recovery
MANAGEMENT TRAINEE PROGRAM
Email Support Executive (International)
Data Entry Operator
Admin officer
...and so on.
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://books.toscrape.com/'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://books.toscrape.com/' )
soup=BeautifulSoup(r.content, 'html.parser')
productlinks=[]
Title=[]
Brand=[]
tra = soup.find_all('article',class_='product_pod')
for links in tra:
for link in links.find_all('a',href=True)[1:]:
comp=baseurl+link['href']
productlinks.append(comp)
for link in productlinks:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
try:
title=soup.find('h3').text
except:
title=' '
Title.append(title)
price=soup.find('p',class_="price_color").text.replace('£','').replace(',','').strip()
Brand.append(price)
df = pd.DataFrame(
{"Title": Title, "Price": price}
)
print(df)
上面的脚本按预期工作,但我想抓取每个产品的信息,例如upc
、product type
example获取这些单页的信息
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
抓取 upc
,product type
等...所有其他信息位于 产品信息
您可以在 URL 中使用 start=
参数来获取下一页:
import requests
from bs4 import BeautifulSoup
for page in range(0, 10): # <-- increase number of pages here
r = requests.get(
"https://pk.indeed.com/jobs?q=&l=Lahore&start={}".format(page * 10)
)
soup = BeautifulSoup(r.content, "html.parser")
title = soup.find_all("h2", class_="jobTitle")
for i in title:
print(i.text)
打印:
Data Entry Work Online
newAdmin Assistant
newNCG Agent
Data Entry Operator
newResearch Associate Electrical
Administrative Assistant (Executive Assistant)
Admin Assistant Digitally
newIT Officer (Remote Work)
OFFICE ASSISTANT
Cash Officer - Lahore Region
newDeputy Manager Finance
Admin Assistant
Lab Assistant
newProduct Portfolio & Customer Service Specialist
Front Desk Officer
newRelationship Manager, Recovery
MANAGEMENT TRAINEE PROGRAM
Email Support Executive (International)
Data Entry Operator
Admin officer
...and so on.