在网页上查找 href
Find href on web page
我不明白为什么以下内容不起作用 - 我正在寻找并尝试单击此特定 link:
<a href="#/documents/2077">
从 URL 的起点开始,我尝试了一些方法,包括以下内容:
尝试 #1
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT,"COSEWIC-Assessment-and-status-report")))
和
appraisal_html = driver.find_element_by_partial_link_text("COSEWIC-Assessment-and-status-report")
尝试#2
soup = bs(req.text,'html.parser')
for link in soup.find_all('a'):`
print(link.get('href'))`
等等。请记住,这是一个广义搜索,因为每次我进行此搜索时物种名称都会改变,其他所有内容都应该保持相似。
第二次尝试直接从漂亮的 soup 文档中找到了一大堆 links,比如菜单选项卡下的那些,但不是我正在寻找的 href。
第一次尝试由于某种原因超时,没有找到我输入的部分文本。也许这是因为那是页面上的文本而不是 href
本身?
我没有想到的一个解决方案是首先查找在其中找到 link 的边界框,然后在新的较小搜索区域内查找 link,但我仍然没有不知道为什么整页都找不到合适的link
这里有几件事:
COSEWIC-Assessment-and-status-report 不完全是 text,但它是 COSEWIC Assessment and Status Report on the Victoria’s Owl-clover
文本不在 A 标签内,而是在 SPAN:
内
<span data-v-7ee3c58f="" class="name-primary">COSEWIC Assessment and Status Report on the Victoria’s Owl-clover <em>Castilleja victoriae</em> in Canada</span>
因此,要识别 可点击的 元素,您需要引入 WebDriverWait for the and you can use either of the following :
使用 XPATH:
driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[contains(., 'COSEWIC Assessment and Status Report on the Victoria’s Owl-clover')]"))).click()
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
试试这个:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)
driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
time.sleep(2)
driver.find_element_by_xpath("//a[@class='card-header']").click()
import requests
from pprint import pp
headers = {
"api-key": "3A1E8E87503C069448999238ABD05EE9"
}
params = {
'api-version': '2017-11-11'
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
req.params = params
data = {
"count": 'true',
"filter": "((documentTypeId eq 18))",
"orderby": "documentTypeSort asc,sortDate desc,documentCreateDate asc,documentTitleSort asc",
"queryType": "full",
"search": "/.*Victoria's.*/ /.*Owl-clover.*/",
"searchMode": "all",
"select": "id,consultationEndDate,consultationStartDate,consultationActivationStatusId,documentCreateDate,documentDescription,documentTitle,documentTypeId,species,attachments,contacts,links,finalOrDelayed",
"skip": 0,
"top": 10
}
r = req.post(url, json=data)
ndata = {
'filter': f"id eq '{r.json()['value'][0]['id']}'"
}
r = req.post(url, json=ndata)
pp(r.json())
main('https://ecprccsarsrch.search.windows.net/indexes/docblobidxen/docs/search')
输出:
{'@odata.context': "https://ecprccsarsrch.search.windows.net/indexes('docblobidxen')/$metadata#docs(*)",
'value': [{'@search.score': 1.0,
'id': '2077',
'documentTitle': 'COSEWIC Assessment and Status Report on the '
'Victoria’s Owl-clover <em>Castilleja '
'victoriae</em> in Canada',
'documentCreateDate': '2010-09-01T13:54:36.8Z',
'documentDescription': 'Victoria’s Owl-clover (<em>Castilleja '
'victoriae</em>) is a newly described '
'species, previously misidentified as '
'(<em>C. ambigua</em> ssp. '
'<em>ambigua</em>). It is a small herb of '
'the broomrape family with alternate, '
'hairy, lobed stem leaves and no basal '
'rosette. The wider and more deeply lobed '
'upper leaves grade into the floral bracts. '
'The sepals are fused into a five-lobed '
'calyx, and the petals are fused into a '
'2-lipped flower 10-18 mm long. The lower '
'lip is lemon-yellow with minute white tips '
'on each of the three lobes. The upper lip '
'is slightly longer than the lower lip and '
'creamy white. The fruits are brown, '
'2-celled capsules that split at the tip '
'when the seeds are ripe. Each capsule '
'bears 30-70 brown seeds with a sculptured '
'seed coat.',
'documentTypeId': 18,
'consultationStartDate': None,
'consultationEndDate': None,
'consultationActivationStatusId': 0,
'finalOrDelayed': 6,
'attachments': ['{"attachmentId":"8142","attachmentTitle":"COSEWIC '
'Assessment and Status Report on the Victoria’s '
'Owl-clover <em>Castilleja victoriae</em> in '
'Canada","attachmentPublicationDate":"2010-09-03T00:00:00","file":"/cosewic/sr_Victoria\'s '
'Owl-clover_0810_e.pdf","html":"https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/victoria-owl-clover-2010.html"}'],
'contacts': ['{"salutation":"None","title":"","id":33,"firstName":"","lastName":"","organization":"COSEWIC '
'Secretariat","address":"c/o Canadian Wildlife '
'Service\r\n Environment '
'Canada","postalCode":"K1A0H3","city":"Ottawa","province":"ON","phone":"8199384125","email":"cosewic-cosepac@ec.gc.ca","fax":"8199383984"}'],
'links': [],
'species': ['1084-749']}]}
我将 selenium 与 bs4 一起使用。你要抓取的url是亲戚,我也转成绝对的urls.You可以从取消注释的部分获取绝对url
PS:您只需要安装管理器:pip install webdriver-manager
和 运行 脚本。
脚本:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
url = 'https://species-registry.canada.ca/index-en.html#/documents?sortBy=documentTypeSort&sortDirection=asc¤tPage=1&pageSize=10'
cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)
base_url = 'https://species-registry.canada.ca/index-en.html'
soup = BeautifulSoup(driver.page_source, 'html.parser')
hrefs=soup.find_all('a',class_='card-header')
for href in hrefs:
relative_url= href['href']
print(relative_url)
#abs_url= base_url + href['href']
#print(abs_url)
作为亲戚输出:
#/documents/2968
#/documents/3002
#/documents/1590
#/documents/3332
#/documents/3354
#/documents/3357
#/documents/1451
#/documents/3325
#/documents/3333
#/documents/205
作为绝对 URL 输出:
https://species-registry.canada.ca/index-en.html#/documents/2968
https://species-registry.canada.ca/index-en.html#/documents/3002
https://species-registry.canada.ca/index-en.html#/documents/1590
https://species-registry.canada.ca/index-en.html#/documents/3332
https://species-registry.canada.ca/index-en.html#/documents/3354
https://species-registry.canada.ca/index-en.html#/documents/3357
https://species-registry.canada.ca/index-en.html#/documents/1451
https://species-registry.canada.ca/index-en.html#/documents/3325
https://species-registry.canada.ca/index-en.html#/documents/3333
https://species-registry.canada.ca/index-en.html#/documents/205
我不明白为什么以下内容不起作用 - 我正在寻找并尝试单击此特定 link:
<a href="#/documents/2077">
从 URL 的起点开始,我尝试了一些方法,包括以下内容:
尝试 #1
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT,"COSEWIC-Assessment-and-status-report")))
和
appraisal_html = driver.find_element_by_partial_link_text("COSEWIC-Assessment-and-status-report")
尝试#2
soup = bs(req.text,'html.parser')
for link in soup.find_all('a'):`
print(link.get('href'))`
等等。请记住,这是一个广义搜索,因为每次我进行此搜索时物种名称都会改变,其他所有内容都应该保持相似。
第二次尝试直接从漂亮的 soup 文档中找到了一大堆 links,比如菜单选项卡下的那些,但不是我正在寻找的 href。
第一次尝试由于某种原因超时,没有找到我输入的部分文本。也许这是因为那是页面上的文本而不是 href
本身?
我没有想到的一个解决方案是首先查找在其中找到 link 的边界框,然后在新的较小搜索区域内查找 link,但我仍然没有不知道为什么整页都找不到合适的link
这里有几件事:
COSEWIC-Assessment-and-status-report 不完全是 text,但它是
COSEWIC Assessment and Status Report on the Victoria’s Owl-clover
文本不在 A 标签内,而是在 SPAN:
内<span data-v-7ee3c58f="" class="name-primary">COSEWIC Assessment and Status Report on the Victoria’s Owl-clover <em>Castilleja victoriae</em> in Canada</span>
因此,要识别 可点击的 元素,您需要引入 WebDriverWait for the
使用 XPATH:
driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover") WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[contains(., 'COSEWIC Assessment and Status Report on the Victoria’s Owl-clover')]"))).click()
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
试试这个:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)
driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
time.sleep(2)
driver.find_element_by_xpath("//a[@class='card-header']").click()
import requests
from pprint import pp
headers = {
"api-key": "3A1E8E87503C069448999238ABD05EE9"
}
params = {
'api-version': '2017-11-11'
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
req.params = params
data = {
"count": 'true',
"filter": "((documentTypeId eq 18))",
"orderby": "documentTypeSort asc,sortDate desc,documentCreateDate asc,documentTitleSort asc",
"queryType": "full",
"search": "/.*Victoria's.*/ /.*Owl-clover.*/",
"searchMode": "all",
"select": "id,consultationEndDate,consultationStartDate,consultationActivationStatusId,documentCreateDate,documentDescription,documentTitle,documentTypeId,species,attachments,contacts,links,finalOrDelayed",
"skip": 0,
"top": 10
}
r = req.post(url, json=data)
ndata = {
'filter': f"id eq '{r.json()['value'][0]['id']}'"
}
r = req.post(url, json=ndata)
pp(r.json())
main('https://ecprccsarsrch.search.windows.net/indexes/docblobidxen/docs/search')
输出:
{'@odata.context': "https://ecprccsarsrch.search.windows.net/indexes('docblobidxen')/$metadata#docs(*)",
'value': [{'@search.score': 1.0,
'id': '2077',
'documentTitle': 'COSEWIC Assessment and Status Report on the '
'Victoria’s Owl-clover <em>Castilleja '
'victoriae</em> in Canada',
'documentCreateDate': '2010-09-01T13:54:36.8Z',
'documentDescription': 'Victoria’s Owl-clover (<em>Castilleja '
'victoriae</em>) is a newly described '
'species, previously misidentified as '
'(<em>C. ambigua</em> ssp. '
'<em>ambigua</em>). It is a small herb of '
'the broomrape family with alternate, '
'hairy, lobed stem leaves and no basal '
'rosette. The wider and more deeply lobed '
'upper leaves grade into the floral bracts. '
'The sepals are fused into a five-lobed '
'calyx, and the petals are fused into a '
'2-lipped flower 10-18 mm long. The lower '
'lip is lemon-yellow with minute white tips '
'on each of the three lobes. The upper lip '
'is slightly longer than the lower lip and '
'creamy white. The fruits are brown, '
'2-celled capsules that split at the tip '
'when the seeds are ripe. Each capsule '
'bears 30-70 brown seeds with a sculptured '
'seed coat.',
'documentTypeId': 18,
'consultationStartDate': None,
'consultationEndDate': None,
'consultationActivationStatusId': 0,
'finalOrDelayed': 6,
'attachments': ['{"attachmentId":"8142","attachmentTitle":"COSEWIC '
'Assessment and Status Report on the Victoria’s '
'Owl-clover <em>Castilleja victoriae</em> in '
'Canada","attachmentPublicationDate":"2010-09-03T00:00:00","file":"/cosewic/sr_Victoria\'s '
'Owl-clover_0810_e.pdf","html":"https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/victoria-owl-clover-2010.html"}'],
'contacts': ['{"salutation":"None","title":"","id":33,"firstName":"","lastName":"","organization":"COSEWIC '
'Secretariat","address":"c/o Canadian Wildlife '
'Service\r\n Environment '
'Canada","postalCode":"K1A0H3","city":"Ottawa","province":"ON","phone":"8199384125","email":"cosewic-cosepac@ec.gc.ca","fax":"8199383984"}'],
'links': [],
'species': ['1084-749']}]}
我将 selenium 与 bs4 一起使用。你要抓取的url是亲戚,我也转成绝对的urls.You可以从取消注释的部分获取绝对url
PS:您只需要安装管理器:pip install webdriver-manager
和 运行 脚本。
脚本:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
url = 'https://species-registry.canada.ca/index-en.html#/documents?sortBy=documentTypeSort&sortDirection=asc¤tPage=1&pageSize=10'
cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)
base_url = 'https://species-registry.canada.ca/index-en.html'
soup = BeautifulSoup(driver.page_source, 'html.parser')
hrefs=soup.find_all('a',class_='card-header')
for href in hrefs:
relative_url= href['href']
print(relative_url)
#abs_url= base_url + href['href']
#print(abs_url)
作为亲戚输出:
#/documents/2968
#/documents/3002
#/documents/1590
#/documents/3332
#/documents/3354
#/documents/3357
#/documents/1451
#/documents/3325
#/documents/3333
#/documents/205
作为绝对 URL 输出:
https://species-registry.canada.ca/index-en.html#/documents/2968
https://species-registry.canada.ca/index-en.html#/documents/3002
https://species-registry.canada.ca/index-en.html#/documents/1590
https://species-registry.canada.ca/index-en.html#/documents/3332
https://species-registry.canada.ca/index-en.html#/documents/3354
https://species-registry.canada.ca/index-en.html#/documents/3357
https://species-registry.canada.ca/index-en.html#/documents/1451
https://species-registry.canada.ca/index-en.html#/documents/3325
https://species-registry.canada.ca/index-en.html#/documents/3333
https://species-registry.canada.ca/index-en.html#/documents/205