报废动态 Table 使用 Selenium WebDriver 等待返回截断的数据帧

Scrapping Dynamic Table Using Selenium WebDriver Wait Returning Truncated Data Frame

我尝试废弃一个名为“holding”的动态 table 来自 https://www.ishares.com/us/products/268752/ishares-global-reit-etf

起初我使用 selenium,但我得到的是空的 DataFrame。然后这里的社区帮助建议我诱导“WebDriverWait”在提取数据之前完全加载数据。它有效,但我得到的数据被从 400 行截断到只有 10 行。我怎样才能得到我需要的所有数据。任何人都可以帮助我。谢谢。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=options)
wd.get(site)

# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2  = pd.read_html(data)
holding = data2[0]

你写的代码没问题,但是你漏掉了一点。默认的 table 是分页设计的,每页只显示 10 条记录,因此您只检索这些记录。您必须添加一个额外的操作步骤(单击 'Show More' 按钮)来显示所有记录,因此您的 df 将拥有所有记录。这是重构后的代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time

# Instantiate options
opt = Options()
opt.add_argument("headless")
opt.add_argument("disable-gpu")
opt.add_argument("window-size=1920,1080")

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=opt)
wd.maximize_window()
wd.get(site)

# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//*[@class='datatables-utilities ui-helper-clearfix']//*[text()='Show More'])[2]"))).click()
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2  = pd.read_html(data)
holding = data2[0]
print(holding)

输出:

    Ticker                           Name  ...    SEDOL Accrual Date
0      PLD              PROLOGIS REIT INC  ...  B44WZD7            -
1     EQIX               EQUINIX REIT INC  ...  BVLZX12            -
2      PSA            PUBLIC STORAGE REIT  ...  2852533            -
3      SPG  SIMON PROPERTY GROUP REIT INC  ...  2812452            -
4      DLR  DIGITAL REALTY TRUST REIT INC  ...  B03GQS4            -
..     ...                            ...  ...      ...          ...
379    MYR                        MYR/USD  ...        -            -
380    MYR                        MYR/USD  ...        -            -
381    MYR                        MYR/USD  ...        -            -
382    MYR                        MYR/USD  ...        -            -
383    MYR                        MYR/USD  ...        -            -

[384 rows x 12 columns]

Process finished with exit code 0