WebScraping JavaScript-在 Python 中使用 Selenium 渲染内容

WebScraping JavaScript-Rendered Content using Selenium in Python

我对网络抓取非常陌生,一直在尝试使用 Selenium 的功能来模拟浏览器访问德克萨斯 public 合同网页,然后下载嵌入的 PDF。网站是这样的:http://www.txsmartbuy.com/sp

到目前为止,我已经成功地使用 Selenium select 一个下拉菜单 "Agency Name" 中的选项并单击搜索按钮。我在下面列出了我的 Python 代码。

import os
os.chdir("/Users/fsouza/Desktop") #Setting up directory

from bs4 import BeautifulSoup #Downloading pertinent Python packages
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

chromedriver = "/Users/fsouza/Desktop/chromedriver" #Setting up Chrome driver
driver = webdriver.Chrome(executable_path=chromedriver)
driver.get("http://www.txsmartbuy.com/sp")
delay = 3 #Seconds

WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "//select[@id='agency-name-filter']/option[69]")))    
health = driver.find_element_by_xpath("//select[@id='agency-name-filter']/option[68]")
health.click()
search = driver.find_element_by_id("spBtnSearch")
search.click()

进入结果页面后,我卡住了。

First,我无法使用 html 页面源访问任何结果 link。但是,如果我手动检查 Chrome 中的个人 link,我会找到与个人结果相关的相关标签 (<a href...)。我猜这是因为 JavaScript 渲染的内容。

Second,即使 Selenium 能够看到这些单独的标签,它们也没有 class 或 id。我认为,调用它们的最佳方式是按显示的顺序调用 <a 标签(参见下面的代码),但这也不起作用。相反,link 调用其他一些 'visible' 标记(页脚中的某些内容,我不需要)。

Third,假设这些东西确实有效,我怎样才能计算出页面上显示的 <a> 标签的数量(以便将此代码循环到每一个结果都结束了)?

driver.execute_script("document.getElementsByTagName('a')[27].click()")

感谢您对此的关注——考虑到我才刚刚起步,请原谅我的任何愚蠢行为。

要在结果中获得您想要的 <a> 标签,请使用以下 xpath:

//tbody//tr//td//strong//a

点击search按钮后,可以循环提取。首先,您需要位于 .visibility_of_all_elements_located:

的所有元素
search.click()

elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//tbody//tr//td//strong//a")))

print(len(elements))

for element in elements:
    get_text = element.text 
    print(get_text)
    url_number = element.get_attribute('onclick').replace('window.open("/sp/', '').replace('");return false;', '')
    get_url = 'http://www.txsmartbuy.com/sp/' +url_number
    print(get_url)

结果其中之一:

IFB HHS0006862, Blanket, San Angelo Canteen Resale. 529-96596. http://www.txsmartbuy.com/sp/HHS0006862

要使用 抓取 JavaScript- 渲染的内容,您需要:

  • 为所需的 .

    引入 WebDriverWait
  • .

    引入WebDriverWait
  • using Ctrl and click() through

    中打开每个link
  • 诱导 WebDriverWait 进行网络抓取。

  • 切换回主页。

  • 代码块:

      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.webdriver.common.action_chains import ActionChains
      from selenium.webdriver.common.keys import Keys
      import time
    
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
      driver.get("http://www.txsmartbuy.com/sp")
      windows_before  = driver.current_window_handle
      WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']"))).click()
      WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']//option[contains(., 'Health & Human Services Commission - 529')]"))).click()
      WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@id='spBtnSearch']/i[@class='icon-search']"))).click()
      for link in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/tbody//tr/td/strong/a"))):
          ActionChains(driver).key_down(Keys.CONTROL).click(link).key_up(Keys.CONTROL).perform()
          WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
          windows_after = driver.window_handles
          new_window = [x for x in windows_after if x != windows_before][0]
          driver.switch_to_window(new_window)
          time.sleep(3)
          print("Focus on the newly opened tab and here you can scrape the page")
          driver.close()
          driver.switch_to_window(windows_before)
      driver.quit()
    
  • 控制台输出:

      Focus on the newly opened tab and here you can scrape the page
      Focus on the newly opened tab and here you can scrape the page
      Focus on the newly opened tab and here you can scrape the page
      .
      .
    
  • 浏览器快照:


参考资料

您可以在以下位置找到一些相关的详细讨论:

  • StaleElementReferenceException even after adding the wait while collecting the data from the wikipedia using web-scraping