在网页上查找 href

Question

我不明白为什么以下内容不起作用 - 我正在寻找并尝试单击此特定 link:

<a href="#/documents/2077">

来自URL：https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover

从 URL 的起点开始，我尝试了一些方法，包括以下内容：

尝试 #1

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT,"COSEWIC-Assessment-and-status-report")))

和

appraisal_html = driver.find_element_by_partial_link_text("COSEWIC-Assessment-and-status-report")

尝试#2

soup = bs(req.text,'html.parser')
for link in soup.find_all('a'):`
print(link.get('href'))`

等等。请记住，这是一个广义搜索，因为每次我进行此搜索时物种名称都会改变，其他所有内容都应该保持相似。

第二次尝试直接从漂亮的 soup 文档中找到了一大堆 links，比如菜单选项卡下的那些，但不是我正在寻找的 href。

第一次尝试由于某种原因超时，没有找到我输入的部分文本。也许这是因为那是页面上的文本而不是 href 本身？

我没有想到的一个解决方案是首先查找在其中找到 link 的边界框，然后在新的较小搜索区域内查找 link，但我仍然没有不知道为什么整页都找不到合适的link

Answer 1

这里有几件事：

COSEWIC-Assessment-and-status-report 不完全是 text，但它是 COSEWIC Assessment and Status Report on the Victoria’s Owl-clover

文本不在 A 标签内，而是在 SPAN:

内

<span data-v-7ee3c58f="" class="name-primary">COSEWIC Assessment and Status Report on the Victoria’s Owl-clover <em>Castilleja victoriae</em> in Canada</span>

因此，要识别 可点击的 元素，您需要引入 WebDriverWait for the and you can use either of the following :

使用 XPATH:

driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[contains(., 'COSEWIC Assessment and Status Report on the Victoria’s Owl-clover')]"))).click()

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Answer 2

试试这个：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time


chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")


driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)

driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
time.sleep(2)

driver.find_element_by_xpath("//a[@class='card-header']").click()

Answer 3

import requests
from pprint import pp
headers = {
    "api-key": "3A1E8E87503C069448999238ABD05EE9"
}

params = {
    'api-version': '2017-11-11'
}


def main(url):
    with requests.Session() as req:
        req.headers.update(headers)
        req.params = params
        data = {
            "count": 'true',
            "filter": "((documentTypeId eq 18))",
            "orderby": "documentTypeSort asc,sortDate desc,documentCreateDate asc,documentTitleSort asc",
            "queryType": "full",
            "search": "/.*Victoria's.*/ /.*Owl-clover.*/",
            "searchMode": "all",
            "select": "id,consultationEndDate,consultationStartDate,consultationActivationStatusId,documentCreateDate,documentDescription,documentTitle,documentTypeId,species,attachments,contacts,links,finalOrDelayed",
            "skip": 0,
            "top": 10
        }
        r = req.post(url, json=data)
        ndata = {
            'filter': f"id eq '{r.json()['value'][0]['id']}'"
        }
        r = req.post(url, json=ndata)
        pp(r.json())


main('https://ecprccsarsrch.search.windows.net/indexes/docblobidxen/docs/search')

输出：

{'@odata.context': "https://ecprccsarsrch.search.windows.net/indexes('docblobidxen')/$metadata#docs(*)",
 'value': [{'@search.score': 1.0,
            'id': '2077',
            'documentTitle': 'COSEWIC Assessment and Status Report on the '
                             'Victoria’s Owl-clover <em>Castilleja '
                             'victoriae</em> in Canada',
            'documentCreateDate': '2010-09-01T13:54:36.8Z',
            'documentDescription': 'Victoria’s Owl-clover (<em>Castilleja '
                                   'victoriae</em>) is a newly described '
                                   'species, previously misidentified as  '
                                   '(<em>C. ambigua</em> ssp. '
                                   '<em>ambigua</em>). It is a small herb of '
                                   'the broomrape family with alternate,  '
                                   'hairy, lobed stem leaves and no basal '
                                   'rosette. The wider and more deeply lobed  '
                                   'upper leaves grade into the floral bracts. '
                                   'The sepals are fused into a  five-lobed '
                                   'calyx, and the petals are fused into a '
                                   '2-lipped flower 10-18 mm  long. The lower '
                                   'lip is lemon-yellow with minute white tips '
                                   'on each of the three  lobes. The upper lip '
                                   'is slightly longer than the lower lip and '
                                   'creamy white. The  fruits are brown, '
                                   '2-celled capsules that split at the tip '
                                   'when the seeds are  ripe. Each capsule '
                                   'bears 30-70 brown seeds with a sculptured '
                                   'seed coat.',
            'documentTypeId': 18,
            'consultationStartDate': None,
            'consultationEndDate': None,
            'consultationActivationStatusId': 0,
            'finalOrDelayed': 6,
            'attachments': ['{"attachmentId":"8142","attachmentTitle":"COSEWIC '
                            'Assessment and Status Report on the Victoria’s '
                            'Owl-clover <em>Castilleja victoriae</em> in '
                            'Canada","attachmentPublicationDate":"2010-09-03T00:00:00","file":"/cosewic/sr_Victoria\'s '
                            'Owl-clover_0810_e.pdf","html":"https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/victoria-owl-clover-2010.html"}'],
            'contacts': ['{"salutation":"None","title":"","id":33,"firstName":"","lastName":"","organization":"COSEWIC '
                         'Secretariat","address":"c/o Canadian Wildlife '
                         'Service\r\n Environment '
                         'Canada","postalCode":"K1A0H3","city":"Ottawa","province":"ON","phone":"8199384125","email":"cosewic-cosepac@ec.gc.ca","fax":"8199383984"}'],
            'links': [],
            'species': ['1084-749']}]}

Answer 4

我将 selenium 与 bs4 一起使用。你要抓取的url是亲戚，我也转成绝对的urls.You可以从取消注释的部分获取绝对url

PS：您只需要安装管理器：pip install webdriver-manager 和运行脚本。

脚本：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time


url = 'https://species-registry.canada.ca/index-en.html#/documents?sortBy=documentTypeSort&sortDirection=asc&currentPage=1&pageSize=10'

cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)

driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)

base_url = 'https://species-registry.canada.ca/index-en.html'
soup = BeautifulSoup(driver.page_source, 'html.parser')
hrefs=soup.find_all('a',class_='card-header')

for href in hrefs:
    relative_url= href['href']
    print(relative_url)
    #abs_url= base_url + href['href']
    #print(abs_url)

作为亲戚输出：

#/documents/2968
#/documents/3002
#/documents/1590
#/documents/3332
#/documents/3354
#/documents/3357
#/documents/1451
#/documents/3325
#/documents/3333
#/documents/205

作为绝对 URL 输出：

https://species-registry.canada.ca/index-en.html#/documents/2968
https://species-registry.canada.ca/index-en.html#/documents/3002
https://species-registry.canada.ca/index-en.html#/documents/1590
https://species-registry.canada.ca/index-en.html#/documents/3332
https://species-registry.canada.ca/index-en.html#/documents/3354
https://species-registry.canada.ca/index-en.html#/documents/3357
https://species-registry.canada.ca/index-en.html#/documents/1451
https://species-registry.canada.ca/index-en.html#/documents/3325
https://species-registry.canada.ca/index-en.html#/documents/3333
https://species-registry.canada.ca/index-en.html#/documents/205

在网页上查找 href

Find href on web page

html

selenium

beautifulsoup

href

webdriverwait

脚本：

作为亲戚输出：

作为绝对 URL 输出：