为什么我的解析图片link出来的是base64格式

Question

我正在尝试解析来自网站的图像 link。当我检查网站上的 link 时，是这个 :https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png 但是当我用我的代码解析它时，输出是 data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7.

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').text

soup = BeautifulSoup(source, 'lxml')

pair = soup.find('div', class_='product-card__body')

image_scr = pair.find('img', class_='css-1fxh5tw product-card__hero-image')['src']
print(image_scr)

我认为代码不是问题，但我不知道是什么导致 link 以 base64 格式出现。那么我如何设置代码以将 link 呈现为 .png 呢？

Answer 1

由于要抓取src意思的图片数据，所以使用requests从服务器下载数据，需要使用.content格式如下：

source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content

Answer 2

会发生什么？

首先，看看你的 soup - 这是事实。网站提供的信息并非都是静态的，有很多内容是动态提供的，也是由浏览器完成的 -> 所以 requests 不会通过这种方式获取此信息。

解决方法

查看您选择旁边的 <noscript>，它包含较小版本的图像并提供 src

例子

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content

soup = BeautifulSoup(source, 'lxml')

pair = soup.find('div', class_='product-card__body')

image_scr = pair.select_one('noscript img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)

输出

https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png

如果您喜欢“大图”，只需将参数 w_318 替换为 w_1000...

编辑

关于您的评论 - 还有很多解决方案，但仍然取决于您喜欢如何处理这些信息以及您将使用什么。

以下方法使用 selenium 不同于 requests 呈现网站并为您返回“正确的页面源”，但也需要更多资源然后 requests:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok')

soup=BeautifulSoup(driver.page_source, 'html.parser')

pair = soup.find('div', class_='product-card__body')

image_scr = pair.select_one('img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)

输出

https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png

为什么我的解析图片link出来的是base64格式

Why is my parsed image link comming out in base64 format

python

base64

parsing

beautifulsoup

会发生什么？

解决方法

例子

输出

编辑

输出