从动态网页中抓取 non-interactable table

Scraping non-interactable table from dynamic webpage

我看过几个有同样问题的帖子,但他们的脚本通常会等到其中一个元素(按钮)可点击。这是我要抓取的 table:

https://ropercenter.cornell.edu/presidential-approval/highslows

前几次尝试,我的代码返回了除两个 Polling Organization 列之外的所有行。不做任何改变,它现在只抓取 table headers 和 tbody 标签(没有 table 行)。

url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)

driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])

我应该使用显式等待吗?如果是这样,我应该等待哪个条件,因为动态 table 不是交互式的?

此外,为什么我的代码的输出在 运行 多次后发生变化?

也许更多的作弊,但更简单的解决方案,确实解决了你的问题,但换句话说,就是看看前端做了什么(使用开发人员工具),并发现它调用 api, returns JSON 值,所以不需要硒。 requestspandas 就够了。

import requests
import pandas as pd

url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"

data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
                            president.id  president.active president.surname president.givenname president.shortname  ... low.approve  low.disapprove low.noOpinion low.sampleSize      low.presidentName
0   e9c0d19b-dfe9-49cf-9939-d06a0f256e57              True             Biden                 Joe                None  ...          33              53            13         1313.0              Joe Biden
1   bc9855d5-8e97-4448-b62e-1fb2865c79e6              True             Trump              Donald                None  ...          29              68             3         5360.0           Donald Trump
2   1c49881f-0f0c-4a53-9b2c-0dd6540f88e4              True             Obama              Barack                None  ...          37              57             5         1017.0           Barack Obama
3   ceda6415-5975-404d-8049-978758a7d1f8              True              Bush           George W.             W. Bush  ...          19              77             4         1100.0         George W. Bush
4   4f7344de-a7bd-4bc6-9147-87963ae51095              True           Clinton                Bill                None  ...          36              50            14          800.0           Bill Clinton
5   116721f1-f947-4c14-b0b5-d521ed5a4c8b              True              Bush         George H.W.           H.W. Bush  ...          29              60            11         1001.0       George H.W. Bush
6   43720f8f-0b9f-43b0-8c0d-63da059e7a57              True            Reagan              Ronald                None  ...          35              56             9         1555.0          Ronald Reagan
7   7aa76fd3-e1bc-4e9a-b13c-463a64e0c864              True            Carter               Jimmy                None  ...          28              59            13         1542.0           Jimmy Carter
8   6255dd77-531d-46c6-bb26-627e2a4b3654              True              Ford              Gerald                None  ...          37              39            24         1519.0            Gerald Ford
9   f1a23b06-4200-41e6-b137-dd46260ac4d8              True             Nixon             Richard                None  ...          23              55            22         1589.0          Richard Nixon
10  772aabfd-289b-4f10-aaae-81a82dd3dbc6              True           Johnson           Lyndon B.                None  ...          35              52            13         1526.0      Lyndon B. Johnson
11  d849b5a8-f711-4ac9-9728-c3915e17bb6a              True           Kennedy             John F.                None  ...          56              30            14         1550.0        John F. Kennedy
12  e22fd64a-cf20-4bc4-8db6-b4e71dc4483d              True        Eisenhower           Dwight D.                None  ...          48              36            16            NaN   Dwight D. Eisenhower
13  ab0bfa04-61da-49d1-8069-6992f6124f17              True            Truman            Harry S.                None  ...          22              65            13            NaN        Harry S. Truman
14  11edf04f-9d8d-4678-976d-b9339b46705d              True         Roosevelt         Franklin D.                None  ...          48              43             8            NaN  Franklin D. Roosevelt

[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
       'president.givenname', 'president.shortname', 'president.fullname',
       'president.number', 'president.terms', 'president.ratings',
       'president.termCount', 'president.ratingCount', 'high.id',
       'high.active', 'high.organization.id', 'high.organization.active',
       'high.organization.name', 'high.organization.ratingCount',
       'high.pollingStart', 'high.pollingEnd', 'high.updated',
       'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
       'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
       'low.organization.id', 'low.organization.active',
       'low.organization.name', 'low.organization.ratingCount',
       'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
       'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
       'low.presidentName'],
      dtype='object')

仅使用 SeleniumGeckoDriver to extract the table contents within the website you need to induce WebDriverWait for the and using from Pandas you can use the following :

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.firefox.options import Options
    from selenium.webdriver.firefox.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    options = Options()
    options.add_argument('--disable-blink-features=AutomationControlled')
    s = Service('C:\BrowserDrivers\geckodriver.exe')
    driver = webdriver.Firefox(service=s, options=options)
    driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows')
    tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped']"))).get_attribute("outerHTML")
    tabledf = pd.read_html(tabledata)
    print(tabledf)
    driver.quit()
    
  • 控制台输出:

    [                President Highest %  ... Lowest %                     Polling Organization & Dates.1
    0               Joe Biden       63%  ...      33%  Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
    1            Donald Trump       49%  ...      29%                  PewJan 8th, 2021 - Jan 12th, 2021
    2            Barack Obama       76%  ...      37%  Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
    3          George W. Bush       92%  ...      19%  American Research GroupFeb 16th, 2008 - Feb 19...
    4            Bill Clinton       73%  ...      36%  Yankelovich Partners / TIME / CNNMay 26th, 199...
    5        George H.W. Bush       89%  ...      29%  Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
    6           Ronald Reagan       68%  ...      35%  Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
    7            Jimmy Carter       75%  ...      28%  Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
    8             Gerald Ford       71%  ...      37%  Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
    9           Richard Nixon       70%  ...      23%   Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
    10      Lyndon B. Johnson       80%  ...      35%  Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
    11        John F. Kennedy       83%  ...      56%  Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
    12   Dwight D. Eisenhower       78%  ...      48%  Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
    13        Harry S. Truman       87%  ...      22%  Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
    14  Franklin D. Roosevelt       84%  ...      48%  Gallup OrganizationAug 18th, 1939 - Aug 24th, ...
    
    [15 rows x 5 columns]]