Selenium xpath:尝试从存档 link 中获取原始 url
Selenium xpath: Trying to get original url from archived link
我正在做一个项目,试图从存档网站上抓取文章。例如,下面是一个存档 url 和原始 url。我有存档 url。而我想用Selenium来提取原来的url.
存档url:https://archive.is/xXAoL
原文url:https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
关于如何获得原件的任何建议 url?
方法一
可能有用的一件事是规范 link 是
https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
我可以删除直到第二个 https 的内容。但是,该方法不起作用,因此寻找另一种不依赖元的方法。
要提取 原始 url 你需要诱导 WebDriverWait for the and you can use either of the following :
使用CSS_SELECTOR:
driver.get('https://archive.is/xXAoL')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value"))
使用 XPATH:
driver.get('https://archive.is/xXAoL')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value"))
控制台输出:
https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
我正在做一个项目,试图从存档网站上抓取文章。例如,下面是一个存档 url 和原始 url。我有存档 url。而我想用Selenium来提取原来的url.
存档url:https://archive.is/xXAoL
原文url:https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
关于如何获得原件的任何建议 url?
方法一
可能有用的一件事是规范 link 是
https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
我可以删除直到第二个 https 的内容。但是,该方法不起作用,因此寻找另一种不依赖元的方法。
要提取 原始 url 你需要诱导 WebDriverWait for the
使用CSS_SELECTOR:
driver.get('https://archive.is/xXAoL') print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value"))
使用 XPATH:
driver.get('https://archive.is/xXAoL') print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value"))
控制台输出:
https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC