解析网页时无法 extract/load 来自 iframe 的所有 href(在 html 页面内)
Cannot extract/load all hrefs from iframe (inside html page) while parsing Webpage
我真的为这个案子苦苦挣扎,一整天都在努力。我需要你的 help.I 我正在尝试抓取此网页:https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=
我想获得所有 137 个 href-s(137 个文档)。
我使用的代码:
def test(self):
final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
self.driver.get(final_url)
soup = BeautifulSoup(self.driver.page_source, 'html.parser')
iframes = soup.find('iframe')
src = iframes['src']
base = 'https://decisions.scc-csc.ca/'
main_url = urljoin(base, src)
self.driver.get((main_url))
browser = self.driver
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 20
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns -= 1
问题是它只加载 25 个第一个文档 (href) 并且不知道如何加载。
此代码向下滚动,直到所有元素都可见,然后将 pdf 的 url 保存在列表 pdfs
中。请注意,所有工作都是使用 selenium 完成的,而无需使用 BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')
# wait for the iframe to be loaded and then switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))
# in this case number_of_results = 137
number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
pdfs = []
while len(pdfs) < number_of_results:
pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
# scroll down to the last visible row
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
time.sleep(1)
pdfs = [pdf.get_attribute('href') for pdf in pdfs]
我真的为这个案子苦苦挣扎,一整天都在努力。我需要你的 help.I 我正在尝试抓取此网页:https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or= 我想获得所有 137 个 href-s(137 个文档)。 我使用的代码:
def test(self):
final_url = 'https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or='
self.driver.get(final_url)
soup = BeautifulSoup(self.driver.page_source, 'html.parser')
iframes = soup.find('iframe')
src = iframes['src']
base = 'https://decisions.scc-csc.ca/'
main_url = urljoin(base, src)
self.driver.get((main_url))
browser = self.driver
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 20
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns -= 1
问题是它只加载 25 个第一个文档 (href) 并且不知道如何加载。
此代码向下滚动,直到所有元素都可见,然后将 pdf 的 url 保存在列表 pdfs
中。请注意,所有工作都是使用 selenium 完成的,而无需使用 BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(options=options, service=Service(your_chromedriver_path))
driver.get('https://decisions.scc-csc.ca/scc-csc/en/d/s/index.do?cont=&ref=&d1=2012-01-01&d2=2022-01-31&p=&col=1&su=16&or=')
# wait for the iframe to be loaded and then switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "decisia-iframe")))
# in this case number_of_results = 137
number_of_results = int(driver.find_element(By.XPATH, "//h2[contains(., 'result')]").text.split()[0])
pdfs = []
while len(pdfs) < number_of_results:
pdfs = driver.find_elements(By.CSS_SELECTOR, 'a[title="Download the PDF version"]')
# scroll down to the last visible row
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', pdfs[-1])
time.sleep(1)
pdfs = [pdf.get_attribute('href') for pdf in pdfs]