报废动态 Table 使用 Selenium WebDriver 等待返回截断的数据帧
Scrapping Dynamic Table Using Selenium WebDriver Wait Returning Truncated Data Frame
我尝试废弃一个名为“holding”的动态 table
来自 https://www.ishares.com/us/products/268752/ishares-global-reit-etf
起初我使用 selenium,但我得到的是空的 DataFrame。然后这里的社区帮助建议我诱导“WebDriverWait”在提取数据之前完全加载数据。它有效,但我得到的数据被从 400 行截断到只有 10 行。我怎样才能得到我需要的所有数据。任何人都可以帮助我。谢谢。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True
# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=options)
wd.get(site)
# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2 = pd.read_html(data)
holding = data2[0]
你写的代码没问题,但是你漏掉了一点。默认的 table 是分页设计的,每页只显示 10 条记录,因此您只检索这些记录。您必须添加一个额外的操作步骤(单击 'Show More' 按钮)来显示所有记录,因此您的 df 将拥有所有记录。这是重构后的代码:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
# Instantiate options
opt = Options()
opt.add_argument("headless")
opt.add_argument("disable-gpu")
opt.add_argument("window-size=1920,1080")
# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=opt)
wd.maximize_window()
wd.get(site)
# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//*[@class='datatables-utilities ui-helper-clearfix']//*[text()='Show More'])[2]"))).click()
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2 = pd.read_html(data)
holding = data2[0]
print(holding)
输出:
Ticker Name ... SEDOL Accrual Date
0 PLD PROLOGIS REIT INC ... B44WZD7 -
1 EQIX EQUINIX REIT INC ... BVLZX12 -
2 PSA PUBLIC STORAGE REIT ... 2852533 -
3 SPG SIMON PROPERTY GROUP REIT INC ... 2812452 -
4 DLR DIGITAL REALTY TRUST REIT INC ... B03GQS4 -
.. ... ... ... ... ...
379 MYR MYR/USD ... - -
380 MYR MYR/USD ... - -
381 MYR MYR/USD ... - -
382 MYR MYR/USD ... - -
383 MYR MYR/USD ... - -
[384 rows x 12 columns]
Process finished with exit code 0
我尝试废弃一个名为“holding”的动态 table 来自 https://www.ishares.com/us/products/268752/ishares-global-reit-etf
起初我使用 selenium,但我得到的是空的 DataFrame。然后这里的社区帮助建议我诱导“WebDriverWait”在提取数据之前完全加载数据。它有效,但我得到的数据被从 400 行截断到只有 10 行。我怎样才能得到我需要的所有数据。任何人都可以帮助我。谢谢。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True
# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=options)
wd.get(site)
# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2 = pd.read_html(data)
holding = data2[0]
你写的代码没问题,但是你漏掉了一点。默认的 table 是分页设计的,每页只显示 10 条记录,因此您只检索这些记录。您必须添加一个额外的操作步骤(单击 'Show More' 按钮)来显示所有记录,因此您的 df 将拥有所有记录。这是重构后的代码:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
# Instantiate options
opt = Options()
opt.add_argument("headless")
opt.add_argument("disable-gpu")
opt.add_argument("window-size=1920,1080")
# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=opt)
wd.maximize_window()
wd.get(site)
# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//*[@class='datatables-utilities ui-helper-clearfix']//*[text()='Show More'])[2]"))).click()
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2 = pd.read_html(data)
holding = data2[0]
print(holding)
输出:
Ticker Name ... SEDOL Accrual Date
0 PLD PROLOGIS REIT INC ... B44WZD7 -
1 EQIX EQUINIX REIT INC ... BVLZX12 -
2 PSA PUBLIC STORAGE REIT ... 2852533 -
3 SPG SIMON PROPERTY GROUP REIT INC ... 2812452 -
4 DLR DIGITAL REALTY TRUST REIT INC ... B03GQS4 -
.. ... ... ... ... ...
379 MYR MYR/USD ... - -
380 MYR MYR/USD ... - -
381 MYR MYR/USD ... - -
382 MYR MYR/USD ... - -
383 MYR MYR/USD ... - -
[384 rows x 12 columns]
Process finished with exit code 0