使用 selenium 和请求下载图像：为什么 WebElement 的 .get_attribute() 方法 returns 是 base64 中的 URL？

Question

我写了一个网络抓取程序，可以进入像 www.tutti.ch 这样的在线市场，搜索类别关键字，然后将搜索结果的所有结果照片下载到一个文件夹中。

#! python3
# imageSiteDownloader_stack.py - A program that goes to an online marketplace like
# tutti.ch, searches for a category of photos, and then downloads all the
# resulting images.

import requests, os
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click() # accepts cookies terms
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys('Gartenstuhl') # enters search key word
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # clicks submit button
os.makedirs('tuttiBilder', exist_ok=True) # creates new folder
images = browser.find_elements(By.TAG_NAME, 'img') # stores every img element in a list
for im in images:
    imageURL = im.get_attribute('src') # get the URL of the image
    print('Downloading image %s...' % (imageURL))
    res = requests.get(imageURL) # downloads the image
    res.raise_for_status()
    imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
    for chunk in res.iter_content(100000): # writes to the image file
        imageFile.write(chunk)
    imageFile.close()
print('Done.')
browser.quit()

我的程序在第26行崩溃，异常如下：

程序正确下载前几张照片，但随后突然崩溃。

在 Whosebug 上寻找解决方案，我找到了这个 post：

上面post的回答表明问题是由于URL.

中的换行符引起的

我检查了照片的来源 URLs 无法下载的 HTML 代码。他们似乎还好。

问题似乎是 browser.find_elements() 方法错误地解析了 'src' 属性值，或者是 .get_attribute() 方法无法获取某些 URLs 正确。而不是得到像

这样的东西

https://c.tutti.ch/images/23452346536.jpg

该方法返回类似

的字符串

data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

当然，这不是 requests.get() 方法可用于下载图像的有效 URL。我做了一些研究，发现这可能是一个 base64 字符串...

为什么 .get_attribute() 方法 return 在某些情况下是 base64 字符串？我可以阻止它这样做吗？还是必须将其转换为普通字符串？

Update：另一种使用 beautifulsoup 解析而不是 WebDriver 的方法。（此代码还不能正常工作。数据 URL 仍然是一个问题）

import requests, sys, os, bs4
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox() # Opens Firefox webbrowser
browser.get('https://www.tutti.ch/') # Go to tutti.ch website
wait = WebDriverWait(browser, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click()
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "._1CFCt > input:nth-child(1)"))).send_keys(sys.argv[1:])
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # https://www.tutorialspoint.com/how-to-locate-element-by-partial-id-match-in-selenium
os.makedirs('tuttiBilder', exist_ok=True)

url = browser.current_url
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

#Check for errors from here
images = soup.select('div[style] > img')

for im in images:
    imageURL = im.get('src') # get the URL of the image
    print('Downloading image %s...' % (imageURL))
    res = requests.get(imageURL) # downloads the image
    res.raise_for_status()
    imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
    for chunk in res.iter_content(100000): # writes to the image file
        imageFile.write(chunk)
    imageFile.close()
print('Done.')
browser.quit()

Answer 1

我可以建议不要使用 Selenium，有一个后端 api 为每个页面提供数据。唯一棘手的事情是对 api 的请求需要有一个特定的 uuid 散列，它位于着陆页的 HTML 中。因此，当您转到登录页面时，您可以得到它，然后用它来签署您的后续 api 调用，这是一个示例，它将遍历每个 post:[=11 的页面和图像=]

import requests
import re
import os

search = 'Gartenstuhl'

headers =   {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = f'https://www.tutti.ch/de/li/ganze-schweiz?q={search}'
step = requests.get(url,headers=headers)
print(step)

uuids = re.findall( r'[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}',step.text)
print(f'tutti hash code: {uuids[0]}') #used to sign requests to api

os.makedirs('tuttiBilder', exist_ok=True)
for page in range(1,10):
    api = f'https://www.tutti.ch/api/v10/list.json?aggregated={page}&limit=30&o=1&q={search}&with_all_regions=true'

    new_headers = {
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
        'x-tutti-hash':uuids[0],
        'x-tutti-source':'web latest-staging'
        }

    resp = requests.get(api,headers=new_headers).json()

    for item in resp['items']:
        for image in item['image_names']:

            image_url = 'https://c.tutti.ch/images/'+image
            pic = requests.get(image_url)

            with open(os.path.join('tuttiBilder', os.path.basename(image)),'wb') as f:
                f.write(pic.content)
            print(f'Saved: {image}')

Answer 2

程序崩溃，因为您尝试使用 base64 编码字符串（这不是有效图像 URL）下载文件（图像）。这些 base64 字符串出现在您的图像列表中的原因是每个图像（在 <img> 标记中）最初似乎是一个 base64 字符串，一旦加载，src 值就会变为有效图像url（您可以通过在 https://...ganze-schweiz?q=Gartenstuhl 访问您的网站时在浏览器中打开 DevTools 并在 DevTools 的“元素”部分中搜索“base64”来检查这一点。通过移动到下一个图像搜索结果 - 使用箭头按钮 - 您会注意到上述行为）。这也是（如您的 cmd window 代码段所示，并且我自己也对其进行了测试）仅找到并下载了 3 到 5 张图像的原因。这是因为这5张图片是出现在页面顶部的图片，并且在访问页面时加载成功并给出了有效图片URL；而其余的 <img> 标签仍然包含一个 base64 字符串。

因此，第一步是 - 一旦“搜索结果”操作完成 - 慢慢向下滚动页面，以便加载页面中的每个图像并赋予有效 URL。您可以使用中描述的方法来实现。您可以根据需要调整速度，只要它允许 items/images 正确加载即可。

第二步是确保只有有效的 URL 被传递给 requests.get() 方法。尽管由于上述修复，每个 base64 字符串都将被有效的 URL 替换，但列表中可能仍然存在无效图像 URL；事实上，似乎有一个（与项目无关）以 https://bat.bing.com/action/0?t.... 开头。因此，谨慎的做法是在尝试下载之前检查请求的 URLs 是否为有效图像 URLs。您可以使用 str.endswith() 方法查找以特定 suffixes（扩展名）结尾的字符串，例如 ".png" 和 ".jpg"。如果图像列表中的字符串确实以上述任何扩展名结尾，您就可以继续下载图像。下面给出了工作示例（请注意，下面将下载出现在搜索结果第一页的图像。如果您需要下载更多图像结果，您可以扩展程序以导航到下一页，然后重复相同的步骤下面）。

更新 1

下面的代码已经更新，因此可以通过导航到以下页面并下载图像来获得更多结果。您可以通过调整 next_pages_no 变量来设置您希望从中获得结果的“下一页”的数量。

import requests, os
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

suffixes = (".png", ".jpg")
next_pages_no = 3
browser = webdriver.Firefox() # Opens Firefox webbrowser
#browser = webdriver.Chrome() # Opens Chrome webbrowser
wait = WebDriverWait(browser, 10)
os.makedirs('tuttiBilder', exist_ok=True) 

def scroll_down_page(speed=40):
    current_scroll_position, new_height= 0, 1
    while current_scroll_position <= new_height:
        current_scroll_position += speed
        browser.execute_script("window.scrollTo(0, {});".format(current_scroll_position))
        new_height = browser.execute_script("return document.body.scrollHeight")

def save_images(images):
    for im in images:
        imageURL = im.get_attribute('src') # gets the URL of the image
        if imageURL.endswith(suffixes):
            print('Downloading image %s...' % (imageURL))
            res = requests.get(imageURL, stream=True) # downloads the image
            res.raise_for_status()
            imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
            for chunk in res.iter_content(1024): # writes to the image file
                imageFile.write(chunk)
            imageFile.close()
  
def get_first_page_results():
    browser.get('https://www.tutti.ch/')
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))).click() # accepts cookies terms
    wait.until(EC.presence_of_element_located((By.XPATH, '//form//*[name()="input"][@data-automation="li-text-input-search"]'))).send_keys('Gartenstuhl') # enters search keyword
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id*='1-val-searchLabel']"))).click() # clicks submit button
    scroll_down_page()  # scroll down the page slowly for the images to load
    images = browser.find_elements(By.TAG_NAME, 'img')  # stores every img element in a list
    save_images(images)

def get_next_page_results():
    wait.until(EC.visibility_of_element_located((By.XPATH, '//button//*[name()="svg"][@data-testid="NavigateNextIcon"]'))).click()
    scroll_down_page()  # scroll down the page slowly for the images to load
    images = browser.find_elements(By.TAG_NAME, 'img')  # stores every img element in a list
    save_images(images)
    

get_first_page_results()

for _ in range(next_pages_no):
    get_next_page_results()
    
print('Done.')
browser.quit()

更新 2

根据您的要求，这是解决该问题的另一种方法，使用 Python 请求下载给定 URL 的 HTML 内容，以及 [=25] =] 库来解析内容，以获得图像 URLs。正如它出现在 HTML 内容中一样，base64 字符串和实际图像 URLs 都包含在内（base64 字符串出现的次数与图像 URLs 完全相同）。因此，您可以使用与上述相同的方法来检查它们的后缀，然后再继续下载它们。下面是完整的工作示例（根据需要调整 for loop 中的页面 range）。

import requests
from bs4 import BeautifulSoup as bs
import os

suffixes = (".png", ".jpg")
os.makedirs('tuttiBilder', exist_ok=True) 


def save_images(imageURLS):
    for imageURL in imageURLS:
        if imageURL.endswith(suffixes):
            print('Downloading image %s...' % (imageURL))
            res = requests.get(imageURL, stream=True) # downloads the image
            res.raise_for_status()
            imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
            for chunk in res.iter_content(1024): # writes to the image file
                imageFile.write(chunk)
            imageFile.close()
 
def get_results(page_no, search_term):
    response = requests.get('https://www.tutti.ch/de/li/ganze-schweiz?o=' + str(page_no) + '&q=' + search_term) 
    soup = bs(response.content, 'html.parser')
    images = soup.findAll("img")
    imageURLS = [image['src'] for image in images]
    save_images(imageURLS)


for i in range(1, 4): # get results from page 1 to page 3
    get_results(i, "Gartenstuhl")

更新 3

澄清一下，base64 字符串都是相同的，即 R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7。您可以通过将收到的 HTML 内容保存在文件中来检查这一点（为此，请在第二种解决方案的 get_results() 方法中添加下面的代码），用文本编辑器打开它并搜索“base64”.

with open("page.html", 'wb') as f:
    f.write(response.content)

如果你把上面的base64字符串输入到一个“base64-to-image”在线转换器中，然后下载并用图形编辑器（比如画图）打开图片，你会看到它是一个1px图像（通常称为“跟踪像素”）。此“跟踪像素”用于 Web beacon 技术以检查用户是否访问了某些内容 - 在您的情况下，是列表中的产品。

base64 字符串不是无效的 URL，它以某种方式变成了有效字符串。它是一个编码的图像字符串，可以对其进行解码以恢复图像。因此，在使用 Selenium 的第一个解决方案中，当在页面上向下滚动时，那些 base64 字符串 不会转换 为有效图像 URLs，而是告诉您访问过某些内容的网站，然后是 removes/hides 他们的网站；这就是它们没有出现在结果中的原因。当您向下滚动到产品时，图像（以及因此的图像 URLs）会立即出现，因为使用了一种称为“图像 Lazy Loading”的常用技术（用于提高性能，用户体验等）。 Lazy-loading 指示浏览器延迟加载 off-screen 的图像，直到用户滚动到它们附近。在方案二中，由于requests.get()用于检索HTML内容，所以base64字符串仍然在HTML文档中；每个产品一个。同样，这些 base64 字符串都是相同的，并且是 1px 图像sed 用于前面提到的目的。因此，您的结果中不需要它们，应该被忽略。以上两种解决方案都会下载网页中存在的所有产品图片。您可以通过在运行程序之后查看 tuttiBilder 文件夹来检查。但是，如果您仍然想要保存那些 base64 图像（这是毫无价值的，因为它们都是一样的并且没有用），请替换第二个解决方案中的 save_images() 方法（即使用 BeautifulSoup）与下面的那个。确保导入额外的库（如下所示）。下面将把所有 base64 图像和产品图像一起保存在同一个 tuttiBilder 文件夹中，并为它们分配唯一标识符作为文件名（因为它们不带文件名）。

import re
import base64
import uuid

def save_images(imageURLS):
    for imageURL in imageURLS:
        if imageURL.endswith(suffixes):
            print('Downloading image %s...' % (imageURL))
            res = requests.get(imageURL, stream=True) # downloads the image
            res.raise_for_status()
            imageFile = open(os.path.join('tuttiBilder', os.path.basename(imageURL)), 'wb') # creates an image file
            for chunk in res.iter_content(1024): # writes to the image file
                imageFile.write(chunk)
            imageFile.close()
        elif imageURL.startswith("data:image/"):
            base64string = re.sub(r"^.*?/.*?,", "", imageURL)
            image_as_bytes = str.encode(base64string)  # convert string to bytes
            recovered_img = base64.b64decode(image_as_bytes)  # decode base64string
            filename = os.path.join('tuttiBilder', str(uuid.uuid4()) + ".png")
            with open(filename, "wb") as f:
                f.write(recovered_img)

Answer 3

那不是任何类型的 URL。实际的图像数据存储在那里，因此它是 base 64 编码的。尝试将其复制到您的浏览器中（从 data: 部分开始），您将看到图像。

刚刚发生的事情是图像没有托管在单独的 URL 上，而是嵌入到网站中，您的浏览器仅对该数据进行解码以呈现图像。如果你想获取原始图像数据，base64decode ;base64,部分之后的所有内容。

使用 selenium 和请求下载图像：为什么 WebElement 的 .get_attribute() 方法 returns 是 base64 中的 URL？

Downloading images with selenium and requests: why does the .get_attribute() method of a WebElement returns a URL in base64?

python

exception

web-scraping

python-requests

selenium-webdriver

更新 1

更新 2

更新 3