Python 网络抓取脚本没有通过 xPath 找到元素，即使它存在

Question

目前我正在编写一个小脚本，它应该提取最便宜产品的名称、link、价格和图片，给定 link 我所在国家/地区的价格比较网站。

示例 link 如下所示：https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist

这是我目前的代码：

#!/usr/bin/env python3
from urllib.request import Request, urlopen
from lxml import html
from lxml import etree


from lxml.etree import tostring


link = 'https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist'
link = 'https://geizhals.at/?cat=monlcd19wide&v=e&hloc=at&sort=p&bl1_id=30&xf=11939_23%7E11955_IPS%7E11963_240%7E14591_19201080'
link = 'https://geizhals.at/?cat=cpuamdam4&xf=25_6%7E5_PCIe+4.0%7E5_SMT%7E820_AM4'

def get_webSite():
    req = Request(link, headers={'User-Agent': 'Mozilla/5.0'})
    return  urlopen(req).read()





webpage = get_webSite() # Contains all HTML from the site
root = html.fromstring(webpage)




price = root.xpath("//*[@id=\"product0\"]/div[6]/span/span")[0].text.strip()
name = root.xpath("//*[@id=\"product0\"]/div[2]/a/span")[0].text.strip()
link = "https://geizhals.at/" + root.xpath("//*[@id=\"product0\"]/div[2]/a/@href")[0]
picture = root.xpath("//*[@id=\"product0\"]/div[1]/a/div/picture/img/@big-image-url")[0]
# the @ refers to the attribute of the selected element, / slashes seem to separate the searched terms
# The [0] refers to the first element of a list, we use this because xPath returns a list with exactly one item

price = price.lstrip('€ ') # removes the euro sign and the space
price = price.replace(',', '.') # removes the comma with a dot
price = float(price) # converts price string to float

print(f"Price : {price}")
print("Name : " + (name))
print("Link : " + (link))
print("PictureLink : " + (picture))

除图片缩略图 link 外，一切正常并打印到控制台中。我已经尝试了正常的 xPath 和完整的 xPath，但都无济于事。没有找到这样的元素，即使它存在。

可能是什么问题？

Answer 1

您的 xpath 中的错误在于：

img/@big-image-url

应该是：

img[@big-image-url]

否则，/ 将遍历到 img 的子级，但您想检查 img 标签本身的属性。这是从页面中抓取所有图像的示例：

import requests
from lxml import html
res=requests.get('https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist')
root = html.fromstring(res.content)
[item.attrib['big-image-url'] for item in root.xpath('//img[@big-image-url]')]
['https://gzhls.at/i/61/20/2436120-n0.jpg', 'https://gzhls.at/i/05/53/2430553-n0.jpg', 'https://gzhls.at/i/75/76/2237576-n0.jpg', 'https://gzhls.at/i/15/28/2201528-n0.jpg', 'https://gzhls.at/i/19/26/2221926-n0.jpg', 'https://gzhls.at/i/06/38/2410638-n0.jpg', 'https://gzhls.at/i/98/04/2459804-n0.jpg', 'https://gzhls.at/i/14/04/2201404-n0.jpg', 'https://gzhls.at/i/24/52/2132452-n0.jpg', 'https://gzhls.at/i/17/64/2401764-n0.jpg', 'https://gzhls.at/i/07/97/2350797-n0.jpg', 'https://gzhls.at/i/50/31/2365031-n0.jpg', 'https://gzhls.at/i/25/01/2322501-n0.jpg', 'https://gzhls.at/i/26/50/2152650-n0.jpg', 'https://gzhls.at/i/27/93/2202793-n0.jpg', 'https://gzhls.at/i/72/69/2267269-n0.jpg', 'https://gzhls.at/i/20/79/2142079-n0.jpg', 'https://gzhls.at/i/06/48/2430648-n0.jpg', 'https://gzhls.at/i/41/24/2294124-n0.jpg', 'https://gzhls.at/i/82/46/2378246-n0.jpg', 'https://gzhls.at/i/46/35/2124635-n0.jpg', 'https://gzhls.at/i/43/84/2304384-n0.jpg', 'https://gzhls.at/i/29/73/2382973-n0.jpg', 'https://gzhls.at/i/07/36/2410736-n0.jpg', 'https://gzhls.at/i/97/54/2459754-n0.jpg', 'https://gzhls.at/i/67/40/2456740-n0.jpg', 'https://gzhls.at/i/15/03/2151503-n0.jpg', 'https://gzhls.at/i/45/26/2244526-n0.jpg', 'https://gzhls.at/i/91/51/2089151-n0.jpg', 'https://gzhls.at/i/39/71/2393971-n0.jpg']

所以它应该在 html big-image-url 属性中，例如：

Python 网络抓取脚本没有通过 xPath 找到元素，即使它存在

Python web scraping script does not find element by xPath even though it exists

python

xpath

html-parsing

web-scraping