为什么 selenium 和 firefox webdriver 无法抓取 ajax 加载的网站标签

Why selenium and firefox webdriver cannot crawl wesite tags loaded by ajax

我想从 bonbast 获取一些 HTML 标签的文本,其中一些元素由 ajax 加载(例如带有“ounce_top”id 的标签)。我已经尝试过 selenium 和 geckodriver 但我还是无法抓取这些标签,而且当 robotic firefox (geckodriver) 打开时,这些元素也没有显示在网页上!我不知道为什么会这样。我如何抓取该网站?

代码试验:

from selenium import webdriver
from bs4 import BeautifulSoup

url_news = 'https://bonbast.com/'
driver = webdriver.Firefox()
driver.get(url_news)
html = driver.page_source
soup = BeautifulSoup(html)
a = driver.find_element_by_id(id_="ounce_top")

所需的元素是动态元素,因此理想情况下要提取所需的文本,即 1,817.43 您需要引入 for the and you can use either of the following :

  • 使用CSS_SELECTOR:

    driver.get("https://bonbast.com/")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.btn-sm.acceptcookies"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span#ounce_top"))).text)
    
  • 使用 XPATH:

    driver.get("https://bonbast.com/")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.btn-sm.acceptcookies"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[@id='ounce_top']"))).text)
    
  • 控制台输出:

    1,817.43
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in

要使用 Selenium 做到这一点,您需要添加等待/延迟。最好使用预期条件显式等待。
我猜您是想获取该元素内的文本值?
这应该有效:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url_news = 'https://bonbast.com/'
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 20)
driver.get(url_news)
html = driver.page_source
soup = BeautifulSoup(html)
your_gold_value = wait.until(EC.visibility_of_element_located((By.ID, "ounce_top"))).text