用 beautifulsoup 抓取图片 URL
Scrapping Image URLs with beautifusoup
今天想学点东西,做点废话。
我正在尝试将产品名称和相应的图片 URL 列到电子表格中。
我设法存储了名称,但图像似乎不起作用。希望你能帮上忙!
这是我用来提取文本的代码:
results[0].find('p', {'class': 'product-card__name'}).get_text()
这是我认为可以提取图像的方法:
results[0].find('img', {'class':'product-card__image'}).get_src()
这显然不是 working.Returning “'NoneType' 对象不可调用”
有什么指点吗?
作为参考,下面是我试图抓取的来源。
<li class="product-grid__item"><a href="/p/63818/bumbu-the-original-rum-glass-pack" class="product-card" title=" Bumbu The Original Rum Glass Pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])"><div class="product-card__image-container"><img src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" alt="Bumbu The Original Rum Glass Pack" class="product-card__image" loading="lazy" width="3" height="4"></div><div class="product-card__content"><p class="product-card__name"> Bumbu The Original Rum<span class="product-card__name-secondary">Glass Pack</span></p><p class="product-card__meta"> 70cl / 40% </p></div><div class="product-card__data"><p class="product-card__price"> £39.95 </p><p class="product-card__unit-price"> (£57.07 per litre) </p></div></a></li>
要获取图像 url,您必须调用 .get('src')
而不是 .get_src()
results[0].find('img', {'class':'product-card__image'}).get('src')
示例:
html='''
<li class="product-grid__item">
<a class="product-card" href="/p/63818/bumbu-the-original-rum-glass-pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])" title=" Bumbu The Original Rum Glass Pack">
<div class="product-card__image-container">
<img alt="Bumbu The Original Rum Glass Pack" class="product-card__image" height="4" loading="lazy" src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" width="3"/>
</div>
<div class="product-card__content">
<p class="product-card__name">
Bumbu The Original Rum
<span class="product-card__name-secondary">
Glass Pack
</span>
</p>
<p class="product-card__meta">
70cl / 40%
</p>
</div>
<div class="product-card__data">
<p class="product-card__price">
£39.95
</p>
<p class="product-card__unit-price">
(£57.07 per litre)
</p>
</div>
</a>
</li>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html, "html.parser")
#print(soup.prettify())
print(soup.find('img', {'class':'product-card__image'}).get('src'))
输出:
https://img.thewhiskyexchange.com/480/rum_bum4.jpg
今天想学点东西,做点废话。
我正在尝试将产品名称和相应的图片 URL 列到电子表格中。
我设法存储了名称,但图像似乎不起作用。希望你能帮上忙!
这是我用来提取文本的代码:
results[0].find('p', {'class': 'product-card__name'}).get_text()
这是我认为可以提取图像的方法:
results[0].find('img', {'class':'product-card__image'}).get_src()
这显然不是 working.Returning “'NoneType' 对象不可调用”
有什么指点吗?
作为参考,下面是我试图抓取的来源。
<li class="product-grid__item"><a href="/p/63818/bumbu-the-original-rum-glass-pack" class="product-card" title=" Bumbu The Original Rum Glass Pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])"><div class="product-card__image-container"><img src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" alt="Bumbu The Original Rum Glass Pack" class="product-card__image" loading="lazy" width="3" height="4"></div><div class="product-card__content"><p class="product-card__name"> Bumbu The Original Rum<span class="product-card__name-secondary">Glass Pack</span></p><p class="product-card__meta"> 70cl / 40% </p></div><div class="product-card__data"><p class="product-card__price"> £39.95 </p><p class="product-card__unit-price"> (£57.07 per litre) </p></div></a></li>
要获取图像 url,您必须调用 .get('src')
而不是 .get_src()
results[0].find('img', {'class':'product-card__image'}).get('src')
示例:
html='''
<li class="product-grid__item">
<a class="product-card" href="/p/63818/bumbu-the-original-rum-glass-pack" onclick="_gaq.push(['_trackEvent', 'Products-GridView', 'click', '63818 : Bumbu The Original Rum / Glass Pack'])" title=" Bumbu The Original Rum Glass Pack">
<div class="product-card__image-container">
<img alt="Bumbu The Original Rum Glass Pack" class="product-card__image" height="4" loading="lazy" src="https://img.thewhiskyexchange.com/480/rum_bum4.jpg" width="3"/>
</div>
<div class="product-card__content">
<p class="product-card__name">
Bumbu The Original Rum
<span class="product-card__name-secondary">
Glass Pack
</span>
</p>
<p class="product-card__meta">
70cl / 40%
</p>
</div>
<div class="product-card__data">
<p class="product-card__price">
£39.95
</p>
<p class="product-card__unit-price">
(£57.07 per litre)
</p>
</div>
</a>
</li>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html, "html.parser")
#print(soup.prettify())
print(soup.find('img', {'class':'product-card__image'}).get('src'))
输出:
https://img.thewhiskyexchange.com/480/rum_bum4.jpg