无法使用 Python 和 Selenium 检索 href 属性

Question

我对此很陌生，花了几个小时尝试我在这里阅读的各种方法。如果我犯了一些愚蠢的错误，我深表歉意

我想为我的 LEGO 套装创建一个数据库。从 brickset.com

中提取图像和信息

我正在使用：

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
anchors = [a.get_attribute('href') for a in anchors]

打印（锚点）returns:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

我的目标是什么：

div id="ui-tabs-2" class="ui-tabs-panel ui-widget-content ui-corner-bottom" aria-live="polite" aria-labelledby="ui-id-4" role="tabpanel" aria-expanded="true" aria-hidden="false" style="display: block;">
<ul class="moreimages">
<li>
<a href="https://images.brickset.com/sets/AdditionalImages/21054-1/21054_alt10.jpg" class="highslide plain " onclick="return hs.expand(this)">
<img src="https://images.brickset.com/sets/AdditionalImages/21054-1/tn_21054_alt10_jpg.jpg" title="" onerror="this.src='/assets/images/spacer2.png'" loading="lazy">
</a><div class="highslide-caption">

我正在想办法解决这个问题。

更新仍然没有获得 href 属性。要添加更多细节，我正在尝试获取此 URL 上“图像”选项卡下的图像： https://brickset.com/sets/21330-1/Home-Alone 这是有问题的代码：

anchors = driver.find_elements(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [anchors.get_attribute('href') for a in anchors]
print('Found ' + str(len(anchors)) + ' links to images')

我也试过：

#anchors = driver.find_elements_by_css_selector("a[href*='21330']")

这只返回了一个 href，尽管应该有大约一打。

谢谢大家的协助！

Answer 1

首先，driver.find_elements_by_xpath 已弃用，请改用 driver.find_element(By.XPATH, 'locator')。

现在，如果您想获取页面上的所有 href 链接：

elements = driver.find_element(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [element.get_attribute('href') for element in elements]

请注意，我没有使用 [1] 来获取单个元素，而是所有元素。

Answer 2

您不应该对多个变量使用相同的名称。

根据第一行代码：

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

anchors 是 WebElements 的列表。理想情况下，要使用 href 属性创建另一个列表，您应该使用另一个名称，例如hrefs

实际上您的代码块将是：

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
hrefs = [a.get_attribute('href') for a in anchors]
print(hrefs)

在一行中使用：

print(a.get_attribute('href') for a in driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a'))

Answer 3

你可能想试试这个。

注意： 我在这里没有使用 selenium。

import time

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0",
}

sample_urls = [
    "https://brickset.com/sets/21330-1/Home-Alone",
    "https://brickset.com/sets/21101-1/Hayabusa"
]

with requests.Session() as s:
    for sample_url in sample_urls:
        ajax_setID = [
            a["href"] for a in
            BeautifulSoup(s.get(sample_url, headers=headers).text, "lxml").find_all("a")
            if "mainImage" in a["href"]
        ][0]
        image_url = f"https://brickset.com{ajax_setID}&_{int(time.time() * 1000)}"
        headers.update(
            {
                "Referer": sample_url,
                "X-Requested-With": "XMLHttpRequest",
            }
        )
        source_image = (
            BeautifulSoup(
                s.get(image_url, headers=headers).text, "lxml"
            ).find("img")["src"]
        )
        print(f"{sample_url.split('/', -1)[-1]} -> {source_image}")

这应该输出：

Home-Alone -> https://images.brickset.com/sets/images/21330-1.jpg?202109060933
Hayabusa -> https://images.brickset.com/sets/images/21101-1.jpg?201201150457

无法使用 Python 和 Selenium 检索 href 属性

Unable to retrieve the href attributes using Python and Selenium

selenium

list-comprehension

href

web-scraping

python-3.x