从动态网页中抓取 non-interactable table
Scraping non-interactable table from dynamic webpage
我看过几个有同样问题的帖子,但他们的脚本通常会等到其中一个元素(按钮)可点击。这是我要抓取的 table:
https://ropercenter.cornell.edu/presidential-approval/highslows
前几次尝试,我的代码返回了除两个 Polling Organization 列之外的所有行。不做任何改变,它现在只抓取 table headers 和 tbody 标签(没有 table 行)。
url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)
driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])
我应该使用显式等待吗?如果是这样,我应该等待哪个条件,因为动态 table 不是交互式的?
此外,为什么我的代码的输出在 运行 多次后发生变化?
也许更多的作弊,但更简单的解决方案,确实解决了你的问题,但换句话说,就是看看前端做了什么(使用开发人员工具),并发现它调用 api, returns JSON 值,所以不需要硒。 requests
和 pandas
就够了。
import requests
import pandas as pd
url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"
data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
president.id president.active president.surname president.givenname president.shortname ... low.approve low.disapprove low.noOpinion low.sampleSize low.presidentName
0 e9c0d19b-dfe9-49cf-9939-d06a0f256e57 True Biden Joe None ... 33 53 13 1313.0 Joe Biden
1 bc9855d5-8e97-4448-b62e-1fb2865c79e6 True Trump Donald None ... 29 68 3 5360.0 Donald Trump
2 1c49881f-0f0c-4a53-9b2c-0dd6540f88e4 True Obama Barack None ... 37 57 5 1017.0 Barack Obama
3 ceda6415-5975-404d-8049-978758a7d1f8 True Bush George W. W. Bush ... 19 77 4 1100.0 George W. Bush
4 4f7344de-a7bd-4bc6-9147-87963ae51095 True Clinton Bill None ... 36 50 14 800.0 Bill Clinton
5 116721f1-f947-4c14-b0b5-d521ed5a4c8b True Bush George H.W. H.W. Bush ... 29 60 11 1001.0 George H.W. Bush
6 43720f8f-0b9f-43b0-8c0d-63da059e7a57 True Reagan Ronald None ... 35 56 9 1555.0 Ronald Reagan
7 7aa76fd3-e1bc-4e9a-b13c-463a64e0c864 True Carter Jimmy None ... 28 59 13 1542.0 Jimmy Carter
8 6255dd77-531d-46c6-bb26-627e2a4b3654 True Ford Gerald None ... 37 39 24 1519.0 Gerald Ford
9 f1a23b06-4200-41e6-b137-dd46260ac4d8 True Nixon Richard None ... 23 55 22 1589.0 Richard Nixon
10 772aabfd-289b-4f10-aaae-81a82dd3dbc6 True Johnson Lyndon B. None ... 35 52 13 1526.0 Lyndon B. Johnson
11 d849b5a8-f711-4ac9-9728-c3915e17bb6a True Kennedy John F. None ... 56 30 14 1550.0 John F. Kennedy
12 e22fd64a-cf20-4bc4-8db6-b4e71dc4483d True Eisenhower Dwight D. None ... 48 36 16 NaN Dwight D. Eisenhower
13 ab0bfa04-61da-49d1-8069-6992f6124f17 True Truman Harry S. None ... 22 65 13 NaN Harry S. Truman
14 11edf04f-9d8d-4678-976d-b9339b46705d True Roosevelt Franklin D. None ... 48 43 8 NaN Franklin D. Roosevelt
[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
'president.givenname', 'president.shortname', 'president.fullname',
'president.number', 'president.terms', 'president.ratings',
'president.termCount', 'president.ratingCount', 'high.id',
'high.active', 'high.organization.id', 'high.organization.active',
'high.organization.name', 'high.organization.ratingCount',
'high.pollingStart', 'high.pollingEnd', 'high.updated',
'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
'low.organization.id', 'low.organization.active',
'low.organization.name', 'low.organization.ratingCount',
'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
'low.presidentName'],
dtype='object')
仅使用 Selenium、GeckoDriver 和 firefox to extract the table contents within the website you need to induce WebDriverWait for the and using from Pandas you can use the following :
代码块:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\BrowserDrivers\geckodriver.exe')
driver = webdriver.Firefox(service=s, options=options)
driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows')
tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped']"))).get_attribute("outerHTML")
tabledf = pd.read_html(tabledata)
print(tabledf)
driver.quit()
控制台输出:
[ President Highest % ... Lowest % Polling Organization & Dates.1
0 Joe Biden 63% ... 33% Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
1 Donald Trump 49% ... 29% PewJan 8th, 2021 - Jan 12th, 2021
2 Barack Obama 76% ... 37% Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
3 George W. Bush 92% ... 19% American Research GroupFeb 16th, 2008 - Feb 19...
4 Bill Clinton 73% ... 36% Yankelovich Partners / TIME / CNNMay 26th, 199...
5 George H.W. Bush 89% ... 29% Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
6 Ronald Reagan 68% ... 35% Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
7 Jimmy Carter 75% ... 28% Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
8 Gerald Ford 71% ... 37% Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
9 Richard Nixon 70% ... 23% Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
10 Lyndon B. Johnson 80% ... 35% Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
11 John F. Kennedy 83% ... 56% Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
12 Dwight D. Eisenhower 78% ... 48% Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
13 Harry S. Truman 87% ... 22% Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
14 Franklin D. Roosevelt 84% ... 48% Gallup OrganizationAug 18th, 1939 - Aug 24th, ...
[15 rows x 5 columns]]
我看过几个有同样问题的帖子,但他们的脚本通常会等到其中一个元素(按钮)可点击。这是我要抓取的 table:
https://ropercenter.cornell.edu/presidential-approval/highslows
前几次尝试,我的代码返回了除两个 Polling Organization 列之外的所有行。不做任何改变,它现在只抓取 table headers 和 tbody 标签(没有 table 行)。
url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)
driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])
我应该使用显式等待吗?如果是这样,我应该等待哪个条件,因为动态 table 不是交互式的?
此外,为什么我的代码的输出在 运行 多次后发生变化?
也许更多的作弊,但更简单的解决方案,确实解决了你的问题,但换句话说,就是看看前端做了什么(使用开发人员工具),并发现它调用 api, returns JSON 值,所以不需要硒。 requests
和 pandas
就够了。
import requests
import pandas as pd
url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"
data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
president.id president.active president.surname president.givenname president.shortname ... low.approve low.disapprove low.noOpinion low.sampleSize low.presidentName
0 e9c0d19b-dfe9-49cf-9939-d06a0f256e57 True Biden Joe None ... 33 53 13 1313.0 Joe Biden
1 bc9855d5-8e97-4448-b62e-1fb2865c79e6 True Trump Donald None ... 29 68 3 5360.0 Donald Trump
2 1c49881f-0f0c-4a53-9b2c-0dd6540f88e4 True Obama Barack None ... 37 57 5 1017.0 Barack Obama
3 ceda6415-5975-404d-8049-978758a7d1f8 True Bush George W. W. Bush ... 19 77 4 1100.0 George W. Bush
4 4f7344de-a7bd-4bc6-9147-87963ae51095 True Clinton Bill None ... 36 50 14 800.0 Bill Clinton
5 116721f1-f947-4c14-b0b5-d521ed5a4c8b True Bush George H.W. H.W. Bush ... 29 60 11 1001.0 George H.W. Bush
6 43720f8f-0b9f-43b0-8c0d-63da059e7a57 True Reagan Ronald None ... 35 56 9 1555.0 Ronald Reagan
7 7aa76fd3-e1bc-4e9a-b13c-463a64e0c864 True Carter Jimmy None ... 28 59 13 1542.0 Jimmy Carter
8 6255dd77-531d-46c6-bb26-627e2a4b3654 True Ford Gerald None ... 37 39 24 1519.0 Gerald Ford
9 f1a23b06-4200-41e6-b137-dd46260ac4d8 True Nixon Richard None ... 23 55 22 1589.0 Richard Nixon
10 772aabfd-289b-4f10-aaae-81a82dd3dbc6 True Johnson Lyndon B. None ... 35 52 13 1526.0 Lyndon B. Johnson
11 d849b5a8-f711-4ac9-9728-c3915e17bb6a True Kennedy John F. None ... 56 30 14 1550.0 John F. Kennedy
12 e22fd64a-cf20-4bc4-8db6-b4e71dc4483d True Eisenhower Dwight D. None ... 48 36 16 NaN Dwight D. Eisenhower
13 ab0bfa04-61da-49d1-8069-6992f6124f17 True Truman Harry S. None ... 22 65 13 NaN Harry S. Truman
14 11edf04f-9d8d-4678-976d-b9339b46705d True Roosevelt Franklin D. None ... 48 43 8 NaN Franklin D. Roosevelt
[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
'president.givenname', 'president.shortname', 'president.fullname',
'president.number', 'president.terms', 'president.ratings',
'president.termCount', 'president.ratingCount', 'high.id',
'high.active', 'high.organization.id', 'high.organization.active',
'high.organization.name', 'high.organization.ratingCount',
'high.pollingStart', 'high.pollingEnd', 'high.updated',
'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
'low.organization.id', 'low.organization.active',
'low.organization.name', 'low.organization.ratingCount',
'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
'low.presidentName'],
dtype='object')
仅使用 Selenium、GeckoDriver 和 firefox to extract the table contents within the website you need to induce WebDriverWait for the
代码块:
from selenium import webdriver from selenium.webdriver.firefox.options import Options from selenium.webdriver.firefox.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC import pandas as pd options = Options() options.add_argument('--disable-blink-features=AutomationControlled') s = Service('C:\BrowserDrivers\geckodriver.exe') driver = webdriver.Firefox(service=s, options=options) driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows') tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped']"))).get_attribute("outerHTML") tabledf = pd.read_html(tabledata) print(tabledf) driver.quit()
控制台输出:
[ President Highest % ... Lowest % Polling Organization & Dates.1 0 Joe Biden 63% ... 33% Quinnipiac UniversityJan 7th, 2022 - Jan 10th,... 1 Donald Trump 49% ... 29% PewJan 8th, 2021 - Jan 12th, 2021 2 Barack Obama 76% ... 37% Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011 3 George W. Bush 92% ... 19% American Research GroupFeb 16th, 2008 - Feb 19... 4 Bill Clinton 73% ... 36% Yankelovich Partners / TIME / CNNMay 26th, 199... 5 George H.W. Bush 89% ... 29% Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992 6 Ronald Reagan 68% ... 35% Gallup OrganizationJan 28th, 1983 - Jan 31st, ... 7 Jimmy Carter 75% ... 28% Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979 8 Gerald Ford 71% ... 37% Gallup OrganizationJan 10th, 1975 - Jan 13th, ... 9 Richard Nixon 70% ... 23% Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974 10 Lyndon B. Johnson 80% ... 35% Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968 11 John F. Kennedy 83% ... 56% Gallup OrganizationSep 12th, 1963 - Sep 17th, ... 12 Dwight D. Eisenhower 78% ... 48% Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958 13 Harry S. Truman 87% ... 22% Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952 14 Franklin D. Roosevelt 84% ... 48% Gallup OrganizationAug 18th, 1939 - Aug 24th, ... [15 rows x 5 columns]]