从特定电子商务网站的 link 抓取图片
scrape images from a particular e-commerce website's link
我正在抓取一个电子商务网站以获取经验。我目前在抓取产品图片时遇到问题。
我已经抓取了产品所有现有图像的 html 代码,但无法从该 html 代码中提取 link。
我试过的代码是:
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://www.preispirat24.com/neu-im-september/'
baseforimages='https://www.preispirat24.com/'
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
productlinks=[]
for x in range(0,1,1):
r=requests.get('https://www.preispirat24.com/neu-im-september/?page={}'.format(x))
soup=BeautifulSoup(r.content, 'html.parser')
productlist=soup.find_all('div',class_='title-description')
item='title-description'
for item in productlist:
for link in item.find_all('a',href=True):
productlinks.append(link['href'])
a=(link['href'])
#testlink='https://www.preispirat24.com/Lufterfrischer/axe-air-fresher/axe-mini-vent-dark-temptation-air-freshener-lufterfrischer-6er-t-dsp.html'
insultlist=[]
images=[]
for link in productlinks:
b=link
try:
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
title=soup.find('h1',class_="product-info-title-desktop hidden-xs hidden-sm").text.strip()
description=soup.find(class_='tab-body active',itemprop="description").text.strip()
itemnumber=soup.find('span',itemprop="model").text.strip()
images=soup.find_all(class_='align-vertical')
print(images)
#print (images['src'])
except:
print('----')
insult={
'title':title,
'description':description,
'itemnumber':itemnumber,
'images':images,
'productlink':b
}
insultlist.append(insult)
df=pd.DataFrame(insultlist)
print('Saving :',title)
print(df.head)
df.to_csv('3veerapreispirat24.csv')
我得到的输出是这样的:
<img alt="Mobile Preview: 99671" data-magnifier-src="images/product_images/original_images/99671(1).jpg" src="images/product_images/gallery_images/99671(1).jpg" title="Mobile Preview: 99671"/>
</div>, <div class="align-vertical">
<img alt="Mobile Preview: 99671" data-magnifier-src="images/product_images/original_images/99671.jpg" src="images/product_images/gallery_images/99671.jpg" title="Mobile Preview: 99671"/>
</div>]
我想要的输出:
images/product_images/original_images/99671(1).jpg
images/product_images/gallery_images/99671(1).jpg
images/product_images/original_images/99671.jpg
images/product_images/gallery_images/99671.jpg"
注意我试过了:print(images['src'])
它导致异常打印 ---
示例产品Link 从要提取的产品图像
在此先感谢您的帮助。
您的 images
变量是 HTML <div>
元素的数组,据我所知。您应该遍历数组中的每个项目,找到 <img>
,然后获取其 src
标记,例如:
for element in images:
url = element.find("img").get("src")
要从 link 获取图像 URL,您可以使用此示例:
import requests
from bs4 import BeautifulSoup
url = 'https://www.preispirat24.com/Verbrauchsartikel/Hygiene-Artikel-127/mund-nasen-maske-3-lagig-pink-mit-nasenbuegel-ohrschlaufen-einheitsgroesse-10-stuec.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for img in soup.select('#product_thumbnail_swiper [data-magnifier-src]'):
print('https://www.preispirat24.com/' + img['data-magnifier-src'])
打印:
https://www.preispirat24.com/images/product_images/original_images/99649mix.jpg
https://www.preispirat24.com/images/product_images/original_images/99649.jpg
https://www.preispirat24.com/images/product_images/original_images/99649_0.jpg
https://www.preispirat24.com/images/product_images/original_images/99649_1.jpg
编辑:要将产品保存到 csv,您可以:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.preispirat24.com/Verbrauchsartikel/Hygiene-Artikel-127/mund-nasen-maske-3-lagig-pink-mit-nasenbuegel-ohrschlaufen-einheitsgroesse-10-stuec.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
title=soup.find('h1',class_="product-info-title-desktop hidden-xs hidden-sm").text.strip()
description=soup.find(class_='tab-body active',itemprop="description").text.strip()
itemnumber=soup.find('span',itemprop="model").text.strip()
images = []
for img in soup.select('#product_thumbnail_swiper [data-magnifier-src]'):
images.append('https://www.preispirat24.com/' + img['data-magnifier-src'])
# print('https://www.preispirat24.com/' + img['data-magnifier-src'])
df = pd.DataFrame({
'title':title,
'description':description,
'itemnumber':itemnumber,
'images':[images],
'productlink':url
})
df.to_csv('data.csv')
print(df)
打印:
title ... productlink
0 Mund Nasen Maske 3-lagig PINK mit Nasenbügel, ... ... https://www.preispirat24.com/Verbrauchsartikel...
[1 rows x 5 columns]
并节省 data.csv
我正在抓取一个电子商务网站以获取经验。我目前在抓取产品图片时遇到问题。 我已经抓取了产品所有现有图像的 html 代码,但无法从该 html 代码中提取 link。
我试过的代码是:
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://www.preispirat24.com/neu-im-september/'
baseforimages='https://www.preispirat24.com/'
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
productlinks=[]
for x in range(0,1,1):
r=requests.get('https://www.preispirat24.com/neu-im-september/?page={}'.format(x))
soup=BeautifulSoup(r.content, 'html.parser')
productlist=soup.find_all('div',class_='title-description')
item='title-description'
for item in productlist:
for link in item.find_all('a',href=True):
productlinks.append(link['href'])
a=(link['href'])
#testlink='https://www.preispirat24.com/Lufterfrischer/axe-air-fresher/axe-mini-vent-dark-temptation-air-freshener-lufterfrischer-6er-t-dsp.html'
insultlist=[]
images=[]
for link in productlinks:
b=link
try:
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
title=soup.find('h1',class_="product-info-title-desktop hidden-xs hidden-sm").text.strip()
description=soup.find(class_='tab-body active',itemprop="description").text.strip()
itemnumber=soup.find('span',itemprop="model").text.strip()
images=soup.find_all(class_='align-vertical')
print(images)
#print (images['src'])
except:
print('----')
insult={
'title':title,
'description':description,
'itemnumber':itemnumber,
'images':images,
'productlink':b
}
insultlist.append(insult)
df=pd.DataFrame(insultlist)
print('Saving :',title)
print(df.head)
df.to_csv('3veerapreispirat24.csv')
我得到的输出是这样的:
<img alt="Mobile Preview: 99671" data-magnifier-src="images/product_images/original_images/99671(1).jpg" src="images/product_images/gallery_images/99671(1).jpg" title="Mobile Preview: 99671"/>
</div>, <div class="align-vertical">
<img alt="Mobile Preview: 99671" data-magnifier-src="images/product_images/original_images/99671.jpg" src="images/product_images/gallery_images/99671.jpg" title="Mobile Preview: 99671"/>
</div>]
我想要的输出:
images/product_images/original_images/99671(1).jpg
images/product_images/gallery_images/99671(1).jpg
images/product_images/original_images/99671.jpg
images/product_images/gallery_images/99671.jpg"
注意我试过了:print(images['src'])
它导致异常打印 ---
示例产品Link 从要提取的产品图像
在此先感谢您的帮助。
您的 images
变量是 HTML <div>
元素的数组,据我所知。您应该遍历数组中的每个项目,找到 <img>
,然后获取其 src
标记,例如:
for element in images:
url = element.find("img").get("src")
要从 link 获取图像 URL,您可以使用此示例:
import requests
from bs4 import BeautifulSoup
url = 'https://www.preispirat24.com/Verbrauchsartikel/Hygiene-Artikel-127/mund-nasen-maske-3-lagig-pink-mit-nasenbuegel-ohrschlaufen-einheitsgroesse-10-stuec.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for img in soup.select('#product_thumbnail_swiper [data-magnifier-src]'):
print('https://www.preispirat24.com/' + img['data-magnifier-src'])
打印:
https://www.preispirat24.com/images/product_images/original_images/99649mix.jpg
https://www.preispirat24.com/images/product_images/original_images/99649.jpg
https://www.preispirat24.com/images/product_images/original_images/99649_0.jpg
https://www.preispirat24.com/images/product_images/original_images/99649_1.jpg
编辑:要将产品保存到 csv,您可以:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.preispirat24.com/Verbrauchsartikel/Hygiene-Artikel-127/mund-nasen-maske-3-lagig-pink-mit-nasenbuegel-ohrschlaufen-einheitsgroesse-10-stuec.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
title=soup.find('h1',class_="product-info-title-desktop hidden-xs hidden-sm").text.strip()
description=soup.find(class_='tab-body active',itemprop="description").text.strip()
itemnumber=soup.find('span',itemprop="model").text.strip()
images = []
for img in soup.select('#product_thumbnail_swiper [data-magnifier-src]'):
images.append('https://www.preispirat24.com/' + img['data-magnifier-src'])
# print('https://www.preispirat24.com/' + img['data-magnifier-src'])
df = pd.DataFrame({
'title':title,
'description':description,
'itemnumber':itemnumber,
'images':[images],
'productlink':url
})
df.to_csv('data.csv')
print(df)
打印:
title ... productlink
0 Mund Nasen Maske 3-lagig PINK mit Nasenbügel, ... ... https://www.preispirat24.com/Verbrauchsartikel...
[1 rows x 5 columns]
并节省 data.csv