我的浏览器和获取请求之间的图像源 html 不同

Question

我怀疑这是由于我对 lxml 或 html 的工作方式有误解而发生的，如果有人能根据我的知识填补这一空白，我将不胜感激。

我的代码是：

url = "https://prnt.sc/ca0000"
response = requests.get(url,headers={'User-Agent': 'Chrome'})

# Navigate to the correct img src.
tree = html.fromstring(response.content)
xpath = '/html/body/div[3]/div/div/img/@src'

imageURL = tree.xpath(xpath)[0]

print(imageURL)

我希望在执行此操作时得到如下结果：

data:image/png;base64,iVBORw0KGgoAAA...((THIS IS REALLY LONG))...Jggg==

如果我理解正确的话，这就是图像在我的计算机上本地存储的位置。

然而，当我运行我得到的代码时：

"https://prnt.sc/ca0000"

为什么这些不同？

Answer 1

问题是此页面使用 javaScript 将 data:image/png;base64 ... 替换为 https://prnt.sc/ca0000 但 requests 不能使用 JavaScript.

但是有两个 img 不同 scr - 第一个是标准的 URL 图像 (https:///....) 另一个是假的 https://prnt.sc/ca0000

所以即使没有 JavaScript

，这个 xpath 也适用于我

xpath = '//img[@id="screenshot-image"]/@src'

此代码正确 url 并下载图像。

import requests
from lxml import html

url = "https://prnt.sc/ca0000"

response = requests.get(url, headers={'User-Agent': 'Chrome'})

tree = html.fromstring(response.content)

image_url = tree.xpath('//img[@id="screenshot-image"]/@src')[0]

print(image_url)

# -- download ---

response = requests.get(image_url, headers={'User-Agent': 'Chrome'})

with open('image.png', 'wb') as fh:
    fh.write(response.content)

结果

https://image.prntscr.com/image/797501c08d0a46ae93ff3a477b4f771c.png

我的浏览器和获取请求之间的图像源 html 不同

Image source is different in html between my browser and get request

html

python

xpath

lxml