在 Python 中使用 Selenium 进行网页抓取
Webscraping with Selenium in Python
我正在尝试从 masari.io 中抓取 DAO 列表,但遇到了问题,因为出现以下错误:
DeprecationWarning: executable_path has been deprecated, please pass in a Service object
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
DevTools listening on ws://127.0.0.1:56691/devtools/browser/b4609671-5e6e-4d25-b09e-4116b3dde4bf
[0525/100030.252:INFO:CONSOLE(1)] "enabling sentry error tracker", source: https://messari.io/static/js/main.977a4794.chunk.js (1)
[0525/100030.951:INFO:CONSOLE(2)] "Unable to refresh token: Login required", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.065:INFO:CONSOLE(2)] "
88b d88 88
888b d888 ""
88'8b d8'88
88 '8b d8' 88 ,adPPYba, ,adPPYba, ,adPPYba, ,adPPYYba, 8b,dPPYba, 88
88 '8b d8' 88 a8P_____88 I8[ "" I8[ "" "" 'Y8 88P' "Y8 88
88 '8b d8' 88 8PP""""""" '"Y8ba, '"Y8ba, ,adPPPPP88 88 88
88 '888' 88 "8b, ,aa aa ]8I aa ]8I 88, ,88 88 88
88 '8' 88 '"Ybbd8"' '"YbbdP"' '"YbbdP"' '"8bbdP"Y8 88 88
", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.069:INFO:CONSOLE(2)] "Interested in a CHALLENGE? Check out: https://messari.io/quiz", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
Traceback (most recent call last):
File "c:/Users/Student/webScrape/scraper.py", line 21, in <module>
matches = WebDriverWait(driver, 10).until(
File "C:\Users\Student\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\support\wait.py", line 89, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
Backtrace:
Ordinal0 [0x0096B8F3+2406643]
Ordinal0 [0x008FAF31+1945393]
Ordinal0 [0x007EC748+837448]
Ordinal0 [0x008192E0+1020640]
Ordinal0 [0x0081957B+1021307]
Ordinal0 [0x00846372+1205106]
Ordinal0 [0x008342C4+1131204]
Ordinal0 [0x00844682+1197698]
Ordinal0 [0x00834096+1130646]
Ordinal0 [0x0080E636+976438]
Ordinal0 [0x0080F546+980294]
GetHandleVerifier [0x00BD9612+2498066]
GetHandleVerifier [0x00BCC920+2445600]
GetHandleVerifier [0x00A04F2A+579370]
GetHandleVerifier [0x00A03D36+574774]
Ordinal0 [0x00901C0B+1973259]
Ordinal0 [0x00906688+1992328]
Ordinal0 [0x00906775+1992565]
Ordinal0 [0x0090F8D1+2029777]
BaseThreadInitThunk [0x777BFA29+25]
RtlGetAppContainerNamedObjectPath [0x77B77A7E+286]
RtlGetAppContainerNamedObjectPath [0x77B77A4E+238]
我知道 messari.io 有一个 API,但我几乎可以肯定它只针对他们的资产,而不是他们的 DAO 列表。我尝试使用 Selenium,因为它是一个动态页面,但我仍然遇到问题。这是我的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
url = 'https://messari.io/governor/daos'
DRIVER_PATH = 'PATH_TO_DRIVER_ON_MY_PC'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")
# s = Service('PATH_TO_DRIVER_ON_MY_PC')
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get('https://messari.io/governor/daos')
try:
matches = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "td")))
# for match in matches:
# print(match.text)
finally:
driver.quit()
更新 我修复了 executable_path 警告,但我仍然收到相同的 TimeoutException 错误。当我 运行 它没有 headless 我也收到以下消息:
DevTools listening on ws://127.0.0.1:57773/devtools/browser/4450b78d-3a9f-401a-b39c-2c716ecad924
[9628:20616:0525/102300.840:ERROR:device_event_log_impl.cc(214)] [10:23:00.840] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9628:20616:0525/102300.841:ERROR:device_event_log_impl.cc(214)] [10:23:00.841] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
我假设这部分更多的是硬件消息,我不应该担心基于类似的问题,因为当我拔下鼠标时它删除了其中一个。
使用 selenium4
作为键 executable_path
已弃用,您必须使用 Service()
class 的实例以及 ChromeDriverManager().install()
命令,如下所述
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.google.com")
此页面不使用 <td>
来显示 DAO 列表。
它使用 <div>
(带 CSS
)来显示它类似于 table。
并且它在 <h4>
中保留 DAO 的名称
至少它在 Linux.
笔记本电脑上的 Firefox 中使用
完整的工作代码(在 Linux Mint、Python 3.8、Selenium 4.x、Chrome 101.x 上测试)
我使用了模块 webdriver_manager
,因此当 Linux 安装较新版本的 Chrome
时,它会自动下载新的驱动程序
我必须使用 find_elements()
(在单词 elements
中使用 s
)或 presence_of_all_elements_located()
来获取所有 <h4>
.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://messari.io/governor/daos'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")
driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))
driver.get('https://messari.io/governor/daos')
try:
matches = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.TAG_NAME, "h4")))
#matches = driver.find_elements(By.TAG_NAME, "h4")
for match in matches:
if match.text:
print(match.text)
finally:
driver.quit()
结果:
Fei
Rook
Cosmos
Stargate Finance
Aave
Treasure DAO
DODO
Radicle
Goldfinch
Merit Circle
EPNS
Perpetual Protocol
Gitcoin
SuperRare
Indexed
Doodles
Rome DAO
Badger
Paraswap
Unlock
Terra
Shapeshift
Lobis
Pool Together
The Graph
Yearn Finance
Ampleforth
Alpaca Finance
Balancer
Gro Protocol
Sismo DAO
BeethovenX
ENS
Lido
Alchemist
编辑:
要获取所有值,您可能需要滚动页面 - JavaScript 将添加新项目。
有使用 while
循环和 execute_script()
的答案,使用 JavaScript 代码滚动到底部并获取当前高度。如果高度与滚动前不同,则您必须再次滚动,但如果高度相同,则页面结束,现在您可以获取所有项目。
我正在尝试从 masari.io 中抓取 DAO 列表,但遇到了问题,因为出现以下错误:
DeprecationWarning: executable_path has been deprecated, please pass in a Service object
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
DevTools listening on ws://127.0.0.1:56691/devtools/browser/b4609671-5e6e-4d25-b09e-4116b3dde4bf
[0525/100030.252:INFO:CONSOLE(1)] "enabling sentry error tracker", source: https://messari.io/static/js/main.977a4794.chunk.js (1)
[0525/100030.951:INFO:CONSOLE(2)] "Unable to refresh token: Login required", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.065:INFO:CONSOLE(2)] "
88b d88 88
888b d888 ""
88'8b d8'88
88 '8b d8' 88 ,adPPYba, ,adPPYba, ,adPPYba, ,adPPYYba, 8b,dPPYba, 88
88 '8b d8' 88 a8P_____88 I8[ "" I8[ "" "" 'Y8 88P' "Y8 88
88 '8b d8' 88 8PP""""""" '"Y8ba, '"Y8ba, ,adPPPPP88 88 88
88 '888' 88 "8b, ,aa aa ]8I aa ]8I 88, ,88 88 88
88 '8' 88 '"Ybbd8"' '"YbbdP"' '"YbbdP"' '"8bbdP"Y8 88 88
", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
[0525/100031.069:INFO:CONSOLE(2)] "Interested in a CHALLENGE? Check out: https://messari.io/quiz", source: https://messari.io/static/js/23.778d04d0.chunk.js (2)
Traceback (most recent call last):
File "c:/Users/Student/webScrape/scraper.py", line 21, in <module>
matches = WebDriverWait(driver, 10).until(
File "C:\Users\Student\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\support\wait.py", line 89, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
Backtrace:
Ordinal0 [0x0096B8F3+2406643]
Ordinal0 [0x008FAF31+1945393]
Ordinal0 [0x007EC748+837448]
Ordinal0 [0x008192E0+1020640]
Ordinal0 [0x0081957B+1021307]
Ordinal0 [0x00846372+1205106]
Ordinal0 [0x008342C4+1131204]
Ordinal0 [0x00844682+1197698]
Ordinal0 [0x00834096+1130646]
Ordinal0 [0x0080E636+976438]
Ordinal0 [0x0080F546+980294]
GetHandleVerifier [0x00BD9612+2498066]
GetHandleVerifier [0x00BCC920+2445600]
GetHandleVerifier [0x00A04F2A+579370]
GetHandleVerifier [0x00A03D36+574774]
Ordinal0 [0x00901C0B+1973259]
Ordinal0 [0x00906688+1992328]
Ordinal0 [0x00906775+1992565]
Ordinal0 [0x0090F8D1+2029777]
BaseThreadInitThunk [0x777BFA29+25]
RtlGetAppContainerNamedObjectPath [0x77B77A7E+286]
RtlGetAppContainerNamedObjectPath [0x77B77A4E+238]
我知道 messari.io 有一个 API,但我几乎可以肯定它只针对他们的资产,而不是他们的 DAO 列表。我尝试使用 Selenium,因为它是一个动态页面,但我仍然遇到问题。这是我的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
url = 'https://messari.io/governor/daos'
DRIVER_PATH = 'PATH_TO_DRIVER_ON_MY_PC'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")
# s = Service('PATH_TO_DRIVER_ON_MY_PC')
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get('https://messari.io/governor/daos')
try:
matches = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "td")))
# for match in matches:
# print(match.text)
finally:
driver.quit()
更新 我修复了 executable_path 警告,但我仍然收到相同的 TimeoutException 错误。当我 运行 它没有 headless 我也收到以下消息:
DevTools listening on ws://127.0.0.1:57773/devtools/browser/4450b78d-3a9f-401a-b39c-2c716ecad924
[9628:20616:0525/102300.840:ERROR:device_event_log_impl.cc(214)] [10:23:00.840] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9628:20616:0525/102300.841:ERROR:device_event_log_impl.cc(214)] [10:23:00.841] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
我假设这部分更多的是硬件消息,我不应该担心基于类似的问题,因为当我拔下鼠标时它删除了其中一个。
使用 selenium4
作为键 executable_path
已弃用,您必须使用 Service()
class 的实例以及 ChromeDriverManager().install()
命令,如下所述
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get("https://www.google.com")
此页面不使用 <td>
来显示 DAO 列表。
它使用 <div>
(带 CSS
)来显示它类似于 table。
并且它在 <h4>
至少它在 Linux.
笔记本电脑上的 Firefox 中使用完整的工作代码(在 Linux Mint、Python 3.8、Selenium 4.x、Chrome 101.x 上测试)
我使用了模块 webdriver_manager
,因此当 Linux 安装较新版本的 Chrome
我必须使用 find_elements()
(在单词 elements
中使用 s
)或 presence_of_all_elements_located()
来获取所有 <h4>
.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://messari.io/governor/daos'
options = Options()
options.headless = True
options.add_argument("--window-size=1920, 1200")
driver = webdriver.Chrome(options=options, service=Service(ChromeDriverManager().install()))
driver.get('https://messari.io/governor/daos')
try:
matches = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.TAG_NAME, "h4")))
#matches = driver.find_elements(By.TAG_NAME, "h4")
for match in matches:
if match.text:
print(match.text)
finally:
driver.quit()
结果:
Fei
Rook
Cosmos
Stargate Finance
Aave
Treasure DAO
DODO
Radicle
Goldfinch
Merit Circle
EPNS
Perpetual Protocol
Gitcoin
SuperRare
Indexed
Doodles
Rome DAO
Badger
Paraswap
Unlock
Terra
Shapeshift
Lobis
Pool Together
The Graph
Yearn Finance
Ampleforth
Alpaca Finance
Balancer
Gro Protocol
Sismo DAO
BeethovenX
ENS
Lido
Alchemist
编辑:
要获取所有值,您可能需要滚动页面 - JavaScript 将添加新项目。
有使用 while
循环和 execute_script()
的答案,使用 JavaScript 代码滚动到底部并获取当前高度。如果高度与滚动前不同,则您必须再次滚动,但如果高度相同,则页面结束,现在您可以获取所有项目。