Selenium xpath:尝试从存档 link 中获取原始 url

Selenium xpath: Trying to get original url from archived link

我正在做一个项目,试图从存档网站上抓取文章。例如,下面是一个存档 url 和原始 url。我有存档 url。而我想用Selenium来提取原来的url.

存档url:https://archive.is/xXAoL

原文url:https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = "https://archive.is/xXAoL"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)

关于如何获得原件的任何建议 url?

方法一

可能有用的一件事是规范 link 是

https://archive.is/2021.09.07-145059/https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU

我可以删除直到第二个 https 的内容。但是,该方法不起作用,因此寻找另一种不依赖元的方法。

要提取 原始 url 你需要诱导 WebDriverWait for the and you can use either of the following :

  • 使用CSS_SELECTOR:

    driver.get('https://archive.is/xXAoL')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[name='q'][value]"))).get_attribute("value"))
    
  • 使用 XPATH:

    driver.get('https://archive.is/xXAoL')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='q'][@value]"))).get_attribute("value"))
    
  • 控制台输出:

    https://beforeitsnews.com/eu/2021/08/breaking-germany-halts-all-covid-19-vaccines-says-they-are-unsafe-and-no-longer-recommended-2676130.html?fbclid=IwAR3JPcxNHlZ5eQHLyO2teh6_xcrerisBrCNeleOZz7qmxI7_pDJDBlEAIjU
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC