解析网页时无法 extract/load 来自 iframe 的所有 href（在 html 页面内）

Question

我真的为这个案子苦苦挣扎，一整天都在努力。我需要你的 help.I 我正在尝试抓取此网页：https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or= 我想获得所有 137 个 href-s（137 个文档）。我使用的代码：

   def test(self):
        final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
        self.driver.get(final_url)
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')
        iframes = soup.find('iframe')
        src = iframes['src']
        base = 'https://decisions.scc-csc.ca/'
        main_url = urljoin(base, src)
        self.driver.get((main_url))
        browser = self.driver
        elem = browser.find_element_by_tag_name("body")
        no_of_pagedowns = 20
        while no_of_pagedowns:
            elem.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.2)
            no_of_pagedowns -= 1

问题是它只加载 25 个第一个文档 (href) 并且不知道如何加载。

Answer 1

此代码向下滚动，直到所有元素都可见，然后将 pdf 的 url 保存在列表 pdfs 中。请注意，所有工作都是使用 selenium 完成的，而无需使用 BeautifulSoup

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')

# wait for the iframe to be loaded and then switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))

# in this case number_of_results = 137
number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
pdfs = []

while len(pdfs) < number_of_results:
    pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
    # scroll down to the last visible row
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
    time.sleep(1)

pdfs = [pdf.get_attribute('href') for pdf in pdfs]

解析网页时无法 extract/load 来自 iframe 的所有 href（在 html 页面内）

Cannot extract/load all hrefs from iframe (inside html page) while parsing Webpage

python

iframe

selenium

web-scraping