抓取 - 无法识别产品 class
Scraping - Cannot identify product class
大家下午好,
一直在尝试为这个特定页面开发一个抓取工具。
我正在尝试提取产品名称和价格。
代码如下
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.thewhiskyexchange.com/c/339/rum'
response = requests.get(website)
response.status_code
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('li',{'product-grid__item'})
如果我执行“len(results)”,我将得到 24 的结果。
但是当实际调用结果(结果[0])时,我只返回了 1 个项目。
<li class="product-grid__item"><a class="product-card" href="/p/63818/bumbu-the-original-rum-glass-pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])" title=" Bumbu The Original Rum Glass Pack"><div class="product-card__image-container"><img alt="Bumbu The Original Rum Glass Pack" class="product-card__image" height="4" loading="lazy" src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" width="3"/></div><div class="product-card__content"><p class="product-card__name"> Bumbu The Original Rum<span class="product-card__name-secondary">Glass Pack</span></p><p class="product-card__meta"> 70cl / 40% </p></div><div class="product-card__data"><p class="product-card__price"> £39.95 </p><p class="product-card__unit-price"> (£57.07 per litre) </p></div></a></li>
我的问题是:我看对了吗class。我尝试了其他 classes,但它似乎也不起作用。还是代码有问题?
(我应该说我正在努力自学如何编码,所以如果有什么遗漏也不会感到惊讶)
一切正常。 results
实际上是一个 list
data-type 变量(这意味着这个搜索 soup.find_all('li',{'product-grid__item'})
有很多结果),所以这样做 results[0]
你首先访问列表的元素。您可以这样做:print(results)
以查看 results
中的所有元素或使用 for 循环:
for result in results:
print(result)
产品标题紧跟在 [class="product-card__name"]
那是文本节点之后。因此,要获取文本节点值,您可以调用 .find(text=True)
method.The 同样的方法是获取 price.Now,它正在工作
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.thewhiskyexchange.com/c/339/rum'
response = requests.get(website)
response.status_code
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('li',{'product-grid__item'})
for result in results:
title = result.select_one('.product-card__name').find(text=True)
print(title)
try:
price = result.select_one('.product-card__unit-price').find(text=True).replace('(','').replace(')','')
print(price)
except:
pass
输出:
Bumbu The Original Rum
£57.07 per litre
Kraken Black Spiced
£54.64 per litre
Kraken Black Roast Coffee Rum
£38.21 per litre
Doorly's 14 Year Old Rum
£87.79 per litre
Admiral Vernon's Old J Spiced Tiki Fire Rum
£59.93 per litre
Ron Zacapa Centenario Sistema Solera 23 Rum
£78.50 per litre
Old Monk 7 Year Old Rum
£35.64 per litre
Diplomatico Reserva Exclusiva Rum
£64.21 per litre
Pusser's Select Aged 151 Navy Rum
£69.93 per litre
Diplomatico Reserva Exclusiva Rum
£58.50 per litre
El Dorado Rum 15 Year Old
£78.50 per litre
Plantation Extra Old Barbados Rum
£77.50 per litre
Captain Morgan Black Spiced
Doorly's XO Rum
£53.50 per litre
Mount Gay XO Triple Cask Blend
£76.79 per litre
Diplomatico Reserva Exclusiva Rum
£58.50 per litre
Plantation Barbados 5 Year Old Signature Blend Rum
£44.64 per litre
Worthy Park Single Estate Reserve
£69.93 per litre
Pusser's Blue Label British Navy Rum
£39.93 per litre
Ron Zacapa Centenario XO Rum Solera Gran Reserva Especial
£150 per litre
Havana Club 3 Year Old Rum
£30.64 per litre
Santa Teresa 1796 Rum
£74.93 per litre
Eminente Reserva 7 Year Old
£64.93 per litre
Bumbu The Original Rum
£48.21 per litre
大家下午好,
一直在尝试为这个特定页面开发一个抓取工具。
我正在尝试提取产品名称和价格。
代码如下
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.thewhiskyexchange.com/c/339/rum'
response = requests.get(website)
response.status_code
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('li',{'product-grid__item'})
如果我执行“len(results)”,我将得到 24 的结果。
但是当实际调用结果(结果[0])时,我只返回了 1 个项目。
<li class="product-grid__item"><a class="product-card" href="/p/63818/bumbu-the-original-rum-glass-pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])" title=" Bumbu The Original Rum Glass Pack"><div class="product-card__image-container"><img alt="Bumbu The Original Rum Glass Pack" class="product-card__image" height="4" loading="lazy" src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" width="3"/></div><div class="product-card__content"><p class="product-card__name"> Bumbu The Original Rum<span class="product-card__name-secondary">Glass Pack</span></p><p class="product-card__meta"> 70cl / 40% </p></div><div class="product-card__data"><p class="product-card__price"> £39.95 </p><p class="product-card__unit-price"> (£57.07 per litre) </p></div></a></li>
我的问题是:我看对了吗class。我尝试了其他 classes,但它似乎也不起作用。还是代码有问题?
(我应该说我正在努力自学如何编码,所以如果有什么遗漏也不会感到惊讶)
一切正常。 results
实际上是一个 list
data-type 变量(这意味着这个搜索 soup.find_all('li',{'product-grid__item'})
有很多结果),所以这样做 results[0]
你首先访问列表的元素。您可以这样做:print(results)
以查看 results
中的所有元素或使用 for 循环:
for result in results:
print(result)
产品标题紧跟在 [class="product-card__name"]
那是文本节点之后。因此,要获取文本节点值,您可以调用 .find(text=True)
method.The 同样的方法是获取 price.Now,它正在工作
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.thewhiskyexchange.com/c/339/rum'
response = requests.get(website)
response.status_code
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find_all('li',{'product-grid__item'})
for result in results:
title = result.select_one('.product-card__name').find(text=True)
print(title)
try:
price = result.select_one('.product-card__unit-price').find(text=True).replace('(','').replace(')','')
print(price)
except:
pass
输出:
Bumbu The Original Rum
£57.07 per litre
Kraken Black Spiced
£54.64 per litre
Kraken Black Roast Coffee Rum
£38.21 per litre
Doorly's 14 Year Old Rum
£87.79 per litre
Admiral Vernon's Old J Spiced Tiki Fire Rum
£59.93 per litre
Ron Zacapa Centenario Sistema Solera 23 Rum
£78.50 per litre
Old Monk 7 Year Old Rum
£35.64 per litre
Diplomatico Reserva Exclusiva Rum
£64.21 per litre
Pusser's Select Aged 151 Navy Rum
£69.93 per litre
Diplomatico Reserva Exclusiva Rum
£58.50 per litre
El Dorado Rum 15 Year Old
£78.50 per litre
Plantation Extra Old Barbados Rum
£77.50 per litre
Captain Morgan Black Spiced
Doorly's XO Rum
£53.50 per litre
Mount Gay XO Triple Cask Blend
£76.79 per litre
Diplomatico Reserva Exclusiva Rum
£58.50 per litre
Plantation Barbados 5 Year Old Signature Blend Rum
£44.64 per litre
Worthy Park Single Estate Reserve
£69.93 per litre
Pusser's Blue Label British Navy Rum
£39.93 per litre
Ron Zacapa Centenario XO Rum Solera Gran Reserva Especial
£150 per litre
Havana Club 3 Year Old Rum
£30.64 per litre
Santa Teresa 1796 Rum
£74.93 per litre
Eminente Reserva 7 Year Old
£64.93 per litre
Bumbu The Original Rum
£48.21 per litre