在 Selenium webdriver 中应用代理网关
Apply proxy gateway in Selenium webdriver
我的目标是在 Selenium webdriver 中应用代理网关(例如 geosurf.io)。
- 我需要通过使用 DesiredCapabilities 来完成,因为 DesiredCapabilities 似乎是插入代理 [网关] 的唯一方法( source)。
- DesiredCapabilities 功能适用于 Selenium Grid(不仅仅是普通的 Selenium 服务器)。 Selenium Grid docs.
我已经在本地 Windows 10 机器上成功 运行 Selenium Grid。
因此,我编写了以下代码来应用 DesiredCapabilities 和 Selenium webdriver 中使用的功能的代理网关:
import requests
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
PROXY = "gw1.geosurf.io:8080" # my account at geosurf.io, port 8080 - Germany
from selenium.webdriver.common.proxy import Proxy, ProxyType
proxy_object = Proxy()
proxy_object.proxy_type = ProxyType.MANUAL
proxy_object.http_proxy = PROXY
proxy_object.socks_proxy = PROXY
proxy_object.ssl_proxy = PROXY
keep_alive = True
browser_profile=None
webdriver.DesiredCapabilities.FIREFOX = {
"class":"org.openqa.selenium.Proxy",
"autodetect":False,
"platform": "WIN10"
}
driver = webdriver.Remote("http://192.168.43.98:5566/grid/register", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)
当运行宁上面的代码,在 __init__
:
command_executor: http://192.168.43.98:5566/grid/register
capabilities:
{'autodetect': False,
'class': 'org.openqa.selenium.Proxy',
'platform': 'WIN10',
'proxy': {'httpProxy': 'gw1.geosurf.io:8080',
'proxyType': 'MANUAL',
'socksProxy': 'gw1.geosurf.io:8080',
'sslProxy': 'gw1.geosurf.io:8080'}}
然而 问题却出现在 webdriver.py
:
Traceback (most recent call last):
File "C:\Users\User\Documents\RnD\captcha-test\test_geosurf_proxy_gateway.py", line 21, in <module>
driver = webdriver.Remote("http://192.168.43.98:5566/grid/register", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 99, in __init__
self.start_session(desired_capabilities, browser_profile)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 191, in start_session
self.session_id = response['sessionId']
TypeError: string indices must be integers
错误,TypeError: string indices must be integers,似乎不是 proxy gateway
类型,也不与 DesiredCapabilities
的设置相关。
在第190行输出时,response
变量是一个字符串,包含html片段:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link rel="stylesheet" type="text/css" href="/assets/displayhelpservlet.css" media="all"/>
<link href="/assets/favicon.ico" rel="icon" type="image/x-icon" />
<script src="/assets/jquery-3.1.1.min.js" type="text/javascript"></script>
<script src="/assets/displayhelpservlet.js" type="text/javascript"></script>
<script type="text/javascript">
var json = Object.freeze('{"version":"3.4.0","type":"Grid Node","consoleLink":"/wd/hub"}');
</script>
</head>
<body>
<div id="content">
<div id="help-heading">
<h1><span id="logo"></span></h1>
<h2>Selenium <span class="se-type"></span> v.<span class="se-version"></span></h2>
</div>
<div id="content-body">
<p>
Whoops! The URL specified routes to this help page.
</p>
<p>
For more information about Selenium <span class="se-type"></span> please see the
<a class="se-docs">docs</a> and/or visit the <a class="se-wiki">wiki</a>.
<span id="console-item">
Or perhaps you are looking for the Selenium <span class="se-type"></span> <a class="se-console">console</a>.
</span>
</p>
<p>
Happy Testing!
</p>
</div>
<div>
<footer id="help-footer">
Selenium is made possible through the efforts of our open source community, contributions from
these <a href="https://github.com/SeleniumHQ/selenium/blob/master/AUTHORS">people</a>, and our
<a href="http://www.seleniumhq.org/sponsors/">sponsors</a>.
</footer>
</div>
</div>
</body>
</html>
如何解决这个 webdriver.py
问题?
更新
进一步调试时 webdriver.py
我在 response = self.execute(Command.NEW_SESSION, parameters)
之后输出 response
变量:
{'status': 0,
'value': u'<!DOCTYPE html>\n<html lang="en">\n<head>\n <meta charset="UTF-8">\n <link rel="stylesheet" type="text/css" href="/assets/displayhelpservlet.css" media="all"/>\n <link href="/assets/favicon.ico" rel="icon" type="image/x-icon" />\n <script src="/assets/jquery-3.1.1.min.js" type="text/javascript"></script>\n <script src="/assets/displayhelpservlet.js" type="text/javascript"></script>\n <script type="text/javascript">\n var json = Object.freeze(\'{"version":"3.4.0","type":"Grid Node","consoleLink":"/wd/hub"}\');\n </script>\n</head>\n<body>\n\n<div id="content">\n <div id="help-heading">\n <h1><span id="logo"></span></h1>\n <h2>Selenium <span class="se-type"></span> v.<span class="se-version"></span></h2>\n </div>\n\n <div id="content-body">\n <p>\n Whoops! The URL specified routes to this help page.\n </p>\n <p>\n For more information about Selenium <span class="se-type"></span> please see the\n <a class="se-docs">docs</a> and/or visit the <a class="se-wiki">wiki</a>.\n <span id="console-item">\n Or perhaps you are looking for the Selenium <span class="se-type"></span> <a class="se-console">console</a>.\n </span>\n </p>\n <p>\n Happy Testing!\n </p>\n </div>\n\n <div>\n <footer id="help-footer">\n Selenium is made possible through the efforts of our open source community, contributions from\n these <a href="https://github.com/SeleniumHQ/selenium/blob/master/AUTHORS">people</a>, and our\n <a href="http://www.seleniumhq.org/sponsors/">sponsors</a>.\n </footer>\n </div>\n </div>\n\n</body>\n</html>'}
为什么不包含 sessionId
键值?
更新 2
我的部分成功是 运行
`driver = webdriver.Remote("http://192.168.43.98:5566/wd/hub", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)`
作为脚本的最后一行。它产生了以下错误:
Traceback (most recent call last):
File "C:\Users\User\Documents\RnD\captcha-test\test_geosurf_proxy_gateway.py", line 21, in <module>
driver = webdriver.Remote("http://192.168.43.98:5566/wd/hub", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 101, in __init__
self.start_session(desired_capabilities, browser_profile)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 193, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 265, in execute
self.error_handler.check_response(response)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
WebDriverException: Message: The best matching driver provider org.openqa.selenium.ie.InternetExplorerDriver can't create a new driver instance for Capabilities [{proxy={httpProxy=gw1.geosurf.io:8080, proxyType=MANUAL, socksProxy=gw1.geosurf.io:8080, sslProxy=gw1.geosurf.io:8080}, autodetect=false, class=org.openqa.selenium.Proxy, platform=WIN10}]
Build info: version: '3.4.0', revision: 'unknown', time: 'unknown'
System info: host: 'DESKTOP-78JS3VQ', ip: '192.168.43.98', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_131'
Driver info: driver.version: unknown
Stacktrace:
at org.openqa.selenium.remote.server.DefaultDriverFactory.newInstance (DefaultDriverFactory.java:62)
at org.openqa.selenium.remote.server.DefaultSession$BrowserCreator.call (DefaultSession.java:222)
at org.openqa.selenium.remote.server.DefaultSession$BrowserCreator.call (DefaultSession.java:209)
at java.util.concurrent.FutureTask.run (None:-1)
at org.openqa.selenium.remote.server.DefaultSession.run (DefaultSession.java:176)
at java.util.concurrent.ThreadPoolExecutor.runWorker (None:-1)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (None:-1)
at java.lang.Thread.run (None:-1)
将代理地址更改为 localhost:8080
带来了同样的错误...
更新 3
我已经在浏览器中手动 launch/open 节点控制台成功 http://192.168.43.98:5566/wd/hub/static/resource/hub.html
然而,我唯一可以加载的会话是 Chrome 浏览器
未能成功为此网格加载 FireFox 或 IE 10 浏览器会话:
不知道对外部代理插入管理Grid节点有没有帮助
最终我可以使用以下代码打开 Chrome 浏览器实例:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
PROXY = "gw1.geosurf.io:8080" # 8080 - Germany
from selenium.webdriver.common.proxy import Proxy, ProxyType
proxy_object = Proxy()
proxy_object.proxy_type = ProxyType.MANUAL
proxy_object.http_proxy = PROXY
proxy_object.socks_proxy = PROXY
proxy_object.ssl_proxy = PROXY
keep_alive = True
browser_profile=None
capabilities = webdriver.DesiredCapabilities.CHROME.copy()
capabilities['class'] = "org.openqa.selenium.Proxy"
capabilities['platform'] = "WINDOWS"
capabilities['version'] = "10"
capabilities["autodetect"]= False
driver = webdriver.Remote("http://192.168.43.98:5566/wd/hub", capabilities, browser_profile, proxy_object, keep_alive)
driver.get('http://testing-ground.scraping.pro/recaptcha')
raw_input('Press any key to quit Selenium driver: ')
driver.quit()
然而,打开的浏览器实例无法加载任何内容...
我的目标是在 Selenium webdriver 中应用代理网关(例如 geosurf.io)。
- 我需要通过使用 DesiredCapabilities 来完成,因为 DesiredCapabilities 似乎是插入代理 [网关] 的唯一方法( source)。
- DesiredCapabilities 功能适用于 Selenium Grid(不仅仅是普通的 Selenium 服务器)。 Selenium Grid docs.
我已经在本地 Windows 10 机器上成功 运行 Selenium Grid。
因此,我编写了以下代码来应用 DesiredCapabilities 和 Selenium webdriver 中使用的功能的代理网关:
import requests from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities PROXY = "gw1.geosurf.io:8080" # my account at geosurf.io, port 8080 - Germany from selenium.webdriver.common.proxy import Proxy, ProxyType proxy_object = Proxy() proxy_object.proxy_type = ProxyType.MANUAL proxy_object.http_proxy = PROXY proxy_object.socks_proxy = PROXY proxy_object.ssl_proxy = PROXY keep_alive = True browser_profile=None webdriver.DesiredCapabilities.FIREFOX = { "class":"org.openqa.selenium.Proxy", "autodetect":False, "platform": "WIN10" } driver = webdriver.Remote("http://192.168.43.98:5566/grid/register", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)
当运行宁上面的代码,在 __init__
:
command_executor: http://192.168.43.98:5566/grid/register
capabilities:
{'autodetect': False,
'class': 'org.openqa.selenium.Proxy',
'platform': 'WIN10',
'proxy': {'httpProxy': 'gw1.geosurf.io:8080',
'proxyType': 'MANUAL',
'socksProxy': 'gw1.geosurf.io:8080',
'sslProxy': 'gw1.geosurf.io:8080'}}
然而 问题却出现在 webdriver.py
:
Traceback (most recent call last):
File "C:\Users\User\Documents\RnD\captcha-test\test_geosurf_proxy_gateway.py", line 21, in <module>
driver = webdriver.Remote("http://192.168.43.98:5566/grid/register", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 99, in __init__
self.start_session(desired_capabilities, browser_profile)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 191, in start_session
self.session_id = response['sessionId']
TypeError: string indices must be integers
错误,TypeError: string indices must be integers,似乎不是 proxy gateway
类型,也不与 DesiredCapabilities
的设置相关。
在第190行输出时,response
变量是一个字符串,包含html片段:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link rel="stylesheet" type="text/css" href="/assets/displayhelpservlet.css" media="all"/>
<link href="/assets/favicon.ico" rel="icon" type="image/x-icon" />
<script src="/assets/jquery-3.1.1.min.js" type="text/javascript"></script>
<script src="/assets/displayhelpservlet.js" type="text/javascript"></script>
<script type="text/javascript">
var json = Object.freeze('{"version":"3.4.0","type":"Grid Node","consoleLink":"/wd/hub"}');
</script>
</head>
<body>
<div id="content">
<div id="help-heading">
<h1><span id="logo"></span></h1>
<h2>Selenium <span class="se-type"></span> v.<span class="se-version"></span></h2>
</div>
<div id="content-body">
<p>
Whoops! The URL specified routes to this help page.
</p>
<p>
For more information about Selenium <span class="se-type"></span> please see the
<a class="se-docs">docs</a> and/or visit the <a class="se-wiki">wiki</a>.
<span id="console-item">
Or perhaps you are looking for the Selenium <span class="se-type"></span> <a class="se-console">console</a>.
</span>
</p>
<p>
Happy Testing!
</p>
</div>
<div>
<footer id="help-footer">
Selenium is made possible through the efforts of our open source community, contributions from
these <a href="https://github.com/SeleniumHQ/selenium/blob/master/AUTHORS">people</a>, and our
<a href="http://www.seleniumhq.org/sponsors/">sponsors</a>.
</footer>
</div>
</div>
</body>
</html>
如何解决这个 webdriver.py
问题?
更新
进一步调试时 webdriver.py
我在 response = self.execute(Command.NEW_SESSION, parameters)
之后输出 response
变量:
{'status': 0,
'value': u'<!DOCTYPE html>\n<html lang="en">\n<head>\n <meta charset="UTF-8">\n <link rel="stylesheet" type="text/css" href="/assets/displayhelpservlet.css" media="all"/>\n <link href="/assets/favicon.ico" rel="icon" type="image/x-icon" />\n <script src="/assets/jquery-3.1.1.min.js" type="text/javascript"></script>\n <script src="/assets/displayhelpservlet.js" type="text/javascript"></script>\n <script type="text/javascript">\n var json = Object.freeze(\'{"version":"3.4.0","type":"Grid Node","consoleLink":"/wd/hub"}\');\n </script>\n</head>\n<body>\n\n<div id="content">\n <div id="help-heading">\n <h1><span id="logo"></span></h1>\n <h2>Selenium <span class="se-type"></span> v.<span class="se-version"></span></h2>\n </div>\n\n <div id="content-body">\n <p>\n Whoops! The URL specified routes to this help page.\n </p>\n <p>\n For more information about Selenium <span class="se-type"></span> please see the\n <a class="se-docs">docs</a> and/or visit the <a class="se-wiki">wiki</a>.\n <span id="console-item">\n Or perhaps you are looking for the Selenium <span class="se-type"></span> <a class="se-console">console</a>.\n </span>\n </p>\n <p>\n Happy Testing!\n </p>\n </div>\n\n <div>\n <footer id="help-footer">\n Selenium is made possible through the efforts of our open source community, contributions from\n these <a href="https://github.com/SeleniumHQ/selenium/blob/master/AUTHORS">people</a>, and our\n <a href="http://www.seleniumhq.org/sponsors/">sponsors</a>.\n </footer>\n </div>\n </div>\n\n</body>\n</html>'}
为什么不包含 sessionId
键值?
更新 2
我的部分成功是 运行
`driver = webdriver.Remote("http://192.168.43.98:5566/wd/hub", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)`
作为脚本的最后一行。它产生了以下错误:
Traceback (most recent call last):
File "C:\Users\User\Documents\RnD\captcha-test\test_geosurf_proxy_gateway.py", line 21, in <module>
driver = webdriver.Remote("http://192.168.43.98:5566/wd/hub", webdriver.DesiredCapabilities.FIREFOX, browser_profile, proxy_object, keep_alive)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 101, in __init__
self.start_session(desired_capabilities, browser_profile)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 193, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 265, in execute
self.error_handler.check_response(response)
File "C:\Python27\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
WebDriverException: Message: The best matching driver provider org.openqa.selenium.ie.InternetExplorerDriver can't create a new driver instance for Capabilities [{proxy={httpProxy=gw1.geosurf.io:8080, proxyType=MANUAL, socksProxy=gw1.geosurf.io:8080, sslProxy=gw1.geosurf.io:8080}, autodetect=false, class=org.openqa.selenium.Proxy, platform=WIN10}]
Build info: version: '3.4.0', revision: 'unknown', time: 'unknown'
System info: host: 'DESKTOP-78JS3VQ', ip: '192.168.43.98', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_131'
Driver info: driver.version: unknown
Stacktrace:
at org.openqa.selenium.remote.server.DefaultDriverFactory.newInstance (DefaultDriverFactory.java:62)
at org.openqa.selenium.remote.server.DefaultSession$BrowserCreator.call (DefaultSession.java:222)
at org.openqa.selenium.remote.server.DefaultSession$BrowserCreator.call (DefaultSession.java:209)
at java.util.concurrent.FutureTask.run (None:-1)
at org.openqa.selenium.remote.server.DefaultSession.run (DefaultSession.java:176)
at java.util.concurrent.ThreadPoolExecutor.runWorker (None:-1)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (None:-1)
at java.lang.Thread.run (None:-1)
将代理地址更改为 localhost:8080
带来了同样的错误...
更新 3
我已经在浏览器中手动 launch/open 节点控制台成功 http://192.168.43.98:5566/wd/hub/static/resource/hub.html
然而,我唯一可以加载的会话是 Chrome 浏览器
未能成功为此网格加载 FireFox 或 IE 10 浏览器会话:
不知道对外部代理插入管理Grid节点有没有帮助
最终我可以使用以下代码打开 Chrome 浏览器实例:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
PROXY = "gw1.geosurf.io:8080" # 8080 - Germany
from selenium.webdriver.common.proxy import Proxy, ProxyType
proxy_object = Proxy()
proxy_object.proxy_type = ProxyType.MANUAL
proxy_object.http_proxy = PROXY
proxy_object.socks_proxy = PROXY
proxy_object.ssl_proxy = PROXY
keep_alive = True
browser_profile=None
capabilities = webdriver.DesiredCapabilities.CHROME.copy()
capabilities['class'] = "org.openqa.selenium.Proxy"
capabilities['platform'] = "WINDOWS"
capabilities['version'] = "10"
capabilities["autodetect"]= False
driver = webdriver.Remote("http://192.168.43.98:5566/wd/hub", capabilities, browser_profile, proxy_object, keep_alive)
driver.get('http://testing-ground.scraping.pro/recaptcha')
raw_input('Press any key to quit Selenium driver: ')
driver.quit()
然而,打开的浏览器实例无法加载任何内容...