使用 python-Requests / urllib3 / 或 selenium 模块获取多个网站 URL 的状态代码

Fetching the status code of multiple web URL's using python-Requests / urllib3 / or selenium module

我正在尝试编写一个 python 脚本来获取约 200 URL 秒的 HTTP 状态代码和响应。最终输出是以 html 格式显示这些详细信息,其中包含 ULR 名称和状态代码、响应消息、错误(如果有)和页面截图。 我曾尝试使用请求和 urllib 模块来开发此脚本,但如果发生任何 HTTPException 而没有捕获该特定 URL 的状态代码和响应消息,我的代码就会中断。 作为替代解决方案,我开发了另一个带有 selenium 模块的 Python 脚本,在其中我试图捕获 URL 的性能日志,特别是 "Network.responseReceived"。

from selenium import webdriver
from datetime import datetime
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

# enable browser logging
d = DesiredCapabilities.CHROME
d['loggingPrefs'] = { 'performance':'ALL' }
options = webdriver.ChromeOptions()  
options.add_argument("--headless")  
driver = webdriver.Chrome(chrome_options=options, executable_path="C:\chromedriver_win32\chromedriver.exe")
#driver = webdriver.Ie(executable_path="C:\IE_driver\MicrosoftWebDriver.exe")
driver.get("https://www.google.com")
#driver.get('https://www.google.com/nonexistant')

print(driver.title)
performance_log = driver.get_log('performance')

for entry in performance_log:
    print(type(entry))
    print (entry)
    print("================================================")
    print(" ")
    print(" ")

driver.close()

下面是我得到的输出。

Google
<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.loadingFinished","params":{"encodedDataLength":0,"requestId":"D99D380DD024B8928B5EAAC76E447956","shouldReportCorbBlocking":false,"timestamp":528401.402473}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228343}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.frameNavigated","params":{"frame":{"id":"8DBAE0AE8594201DC3D129C819A696C8","loaderId":"D99D380DD024B8928B5EAAC76E447956","mimeType":"text/plain","securityOrigin":"://","url":"data:,"}}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228343}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.loadEventFired","params":{"timestamp":528401.409908}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228344}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.frameStoppedLoading","params":{"frameId":"8DBAE0AE8594201DC3D129C819A696C8"}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228346}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.domContentEventFired","params":{"timestamp":528401.41067}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228347}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.requestWillBeSent","params":{"documentURL":"https://www.google.com/","frameId":"8DBAE0AE8594201DC3D129C819A696C8","hasUserGesture":false,"initiator":{"type":"other"},"loaderId":"16D0090B144D4D0D6DB68B993CE5DE12","request":{"headers":{"Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/72.0.3626.109 Safari/537.36"},"initialPriority":"VeryHigh","method":"GET","mixedContentType":"none","referrerPolicy":"no-referrer-when-downgrade","url":"https://www.google.com/"},"requestId":"16D0090B144D4D0D6DB68B993CE5DE12","timestamp":528401.455107,"type":"Document","wallTime":1554297228.37452}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228378}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.responseReceived","params":{"frameId":"8DBAE0AE8594201DC3D129C819A696C8","loaderId":"16D0090B144D4D0D6DB68B993CE5DE12","requestId":"16D0090B144D4D0D6DB68B993CE5DE12","response":{"connectionId":17,"connectionReused":false,"encodedDataLength":6681,"fromDiskCache":false,"fromServiceWorker":false,"headers":{"alt-svc":"quic=\":443\"; ma=2592000; v=\"46,44,43,39\"","cache-control":"private, max-age=0","content-encoding":"gzip","content-length":"65219","content-type":"text/html; charset=UTF-8","date":"Wed, 03 Apr 2019 13:13:52 GMT","expires":"-1","p3p":"CP=\"This is not a P3P policy! See g.co/p3phelp for more info.\"","server":"gws","set-cookie":"1P_JAR=2019-04-03-13; expires=Fri, 03-May-2019 13:13:52 GMT; path=/; domain=.google.com\nNID=180=fV81eC5C8adCVzltTPlJnIxiDUi4bSEzqRVHIQwx7z5S75opd6k3fmtLeGNOllEqRlpcQ-X31RSveq0FgdL5e0GBcVZxYZjzI9g2Bgn_Wepj5RfErPoo5re54HFO-sgiXV5vqNftY7JHm60YxVYQXJqp9HhpdbpB0cJ3HLOCguo; expires=Thu, 03-Oct-2019 13:13:52 GMT; path=/; domain=.google.com; HttpOnly","status":"200","x-frame-options":"SAMEORIGIN","x-xss-protection":"0"},"mimeType":"text/html","protocol":"h2","remoteIPAddress":"172.217.168.196","remotePort":443,"requestHeaders":{":authority":"www.google.com",":method":"GET",":path":"/",":scheme":"https","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","accept-encoding":"gzip, deflate, br","upgrade-insecure-requests":"1","user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/72.0.3626.109 Safari/537.36"},"securityDetails":{"certificateId":0,"certificateTransparencyCompliance":"unknown","cipher":"AES_128_GCM","issuer":"Google Internet Authority G3","keyExchange":"","keyExchangeGroup":"X25519","protocol":"TLS 1.3","sanList":["www.google.com"],"signedCertificateTimestampList":[],"subjectName":"www.google.com","validFrom":1551433595,"validTo":1558689900},"securityState":"secure","status":200,"statusText":"","timing":{"connectEnd":3683.223,"connectStart":2467.054,"dnsEnd":2467.054,"dnsStart":2352.226,"proxyEnd":2351.998,"proxyStart":86.464,"pushEnd":0,"pushStart":0,"receiveHeadersEnd":3976.231,"requestTime":528401.456284,"sendEnd":3687.307,"sendStart":3685.241,"sslEnd":3683.104,"sslStart":2620.349,"workerReady":-1,"workerStart":-1},"url":"https://www.google.com/"},"timestamp":528405.434789,"type":"Document"}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297232388}
================================================



我需要解析 Network.responseReceived 详细信息,因为它包含所有必需的详细信息。那么我应该如何解析 Network.responseReceived 日志中的详细信息。

将每个entry"message"键转换为python字典,并提取需要的属性.

在脚本的开头,添加 json 库的导入;然后,在 performance_log:

的循环中
for entry in performance_log:
    message = json.loads(entry['message'])

现在变量 message 将是一个普通的 python 字典,您可以从中得到任何您需要的 属性。例如,这是状态代码:

print(message['message']['params']['response']['status'])

这是目标 url:

print(message['message']['params']['response']['url'])

请记住,您将获得 browser/the html 创建的每个资源请求的条目 - 您可能只想过滤到 top-most/domain 个。