Python 机械化 HTML 代码不同于 Firebug HTML 代码
Python Mechanize HTML code different from Firebug HTML code
我正在使用 "Mechanize" 提取一些 HTML 代码。但是,我在输出 HTML 代码时遇到问题。从本质上讲,Mechanize 似乎正在将某些元素中的内容替换为“(n/a)”。
示例(Firebug中显示的结构)
<tr>
<td>
<img class="bullet" src="images/bulletorange.gif" alt="">
<span class="detailCaption">Video Format Mode:</span>
<span class="settingValue" id="vidSdSdiAnlgFormatSelectionMode.1.1">Auto</span>
</td>
</tr>
示例(Mechanize 输出的结构)
<tr>
<td>
<img class='bullet' src='images/bulletorange.gif' alt='' />
<span class='detailCaption'>Video Format Mode:</span>
<span class='settingValue' id="vidSdSdiAnlgFormatSelectionMode.1.1">(n/a)</span>
</td>
</tr>
问题是 "Auto" 被替换为“(n/a)”。我不太确定为什么!
请帮忙。为什么机械化要这样做?我该如何解决?
在我的代码下面...
def login_and_return_html(self, url_login, url_after_login, form_username, form_password, username, password):
"""
Description: Returns html code form a website that requires login.
Input Arguments: url_login (str)-The url where you enter the login username and password
url_after_login (str)-The url where you want to go after you login
form_username (str)-The name of the form for the username input field
form_password (str)-The name of the form for the password input field
username (str)-The actual username
password (str)- The actual password
Return or Output: Returns HTML code of the 'url_after_login' page
Modules and Classes: mechanize
ssl
"""
try: # Unabling SSL certificate validation
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError: # Legacy Python that doesn't verify HTTPS certificates by default
pass
else: # Handle target environment that doesn't support HTTPS verification
ssl._create_default_https_context = _create_unverified_https_context
br = mechanize.Browser() # Browser
br.set_handle_equiv(True) # Browser options
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
cj = mechanize.CookieJar() # Cookie Jar
br.set_cookiejar(cj)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),
max_time=1) # Follows refresh 0 but not hangs on refresh > 0
br.open(url_login) # Login
br.select_form(nr=0)
try:
br.form[form_username] = username #Fill in the blank username form
br.form[form_password] = password #Fill in the blank password form
br.submit()
except:
control = br.form.find_control(form_username)
for item in control.items: #Dropdown menu username form
if item.name == username:
item.selected = True
br.form[form_password] = password #Fill in the blank password form
br.submit()
html = br.open(url_after_login).read()
return html
Why is mechanize doing this?
Mechanize 可能不是,但浏览器是。我的猜测是该站点使用 Javascript,机械化不支持它,因此您会得到原始形式的 HTML,即执行任何 Javascript 之前的内容。
And how can I fix it?
不适用于 mechanize,但您需要一些支持 Javascript 的解决方案。有关详细信息和可能的解决方案,请参阅 Mechanize and Javascript。
这是我如何同时获得 HTML 和 Javascript 代码的解决方案。
我使用了 selenium 库。
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
#Using Firefox 48.0.2 and the new WebDriver
caps = DesiredCapabilities.FIREFOX
caps["marionette"] = True
br = webdriver.Firefox(capabilities=caps)
br.get('http://XXX.XXX.XXX.XXX/')
#Input Username and Password
username = br.find_element_by_name('SOME_NAME')
username.send_keys('USERNAME')
password = br.find_element_by_name('SOME_NAME')
password.send_keys('PASSWORD')
form = br.find_element_by_name('submitButton')
form.submit()
time.sleep(20)
#THIS IS WHAT IS DIFFERENT...
td_element = br.find_element_by_xpath('/html')
html = br.execute_script("return arguments[0].innerHTML;", td_element)
print html
我正在使用 "Mechanize" 提取一些 HTML 代码。但是,我在输出 HTML 代码时遇到问题。从本质上讲,Mechanize 似乎正在将某些元素中的内容替换为“(n/a)”。
示例(Firebug中显示的结构)
<tr>
<td>
<img class="bullet" src="images/bulletorange.gif" alt="">
<span class="detailCaption">Video Format Mode:</span>
<span class="settingValue" id="vidSdSdiAnlgFormatSelectionMode.1.1">Auto</span>
</td>
</tr>
示例(Mechanize 输出的结构)
<tr>
<td>
<img class='bullet' src='images/bulletorange.gif' alt='' />
<span class='detailCaption'>Video Format Mode:</span>
<span class='settingValue' id="vidSdSdiAnlgFormatSelectionMode.1.1">(n/a)</span>
</td>
</tr>
问题是 "Auto" 被替换为“(n/a)”。我不太确定为什么!
请帮忙。为什么机械化要这样做?我该如何解决?
在我的代码下面...
def login_and_return_html(self, url_login, url_after_login, form_username, form_password, username, password):
"""
Description: Returns html code form a website that requires login.
Input Arguments: url_login (str)-The url where you enter the login username and password
url_after_login (str)-The url where you want to go after you login
form_username (str)-The name of the form for the username input field
form_password (str)-The name of the form for the password input field
username (str)-The actual username
password (str)- The actual password
Return or Output: Returns HTML code of the 'url_after_login' page
Modules and Classes: mechanize
ssl
"""
try: # Unabling SSL certificate validation
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError: # Legacy Python that doesn't verify HTTPS certificates by default
pass
else: # Handle target environment that doesn't support HTTPS verification
ssl._create_default_https_context = _create_unverified_https_context
br = mechanize.Browser() # Browser
br.set_handle_equiv(True) # Browser options
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
cj = mechanize.CookieJar() # Cookie Jar
br.set_cookiejar(cj)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),
max_time=1) # Follows refresh 0 but not hangs on refresh > 0
br.open(url_login) # Login
br.select_form(nr=0)
try:
br.form[form_username] = username #Fill in the blank username form
br.form[form_password] = password #Fill in the blank password form
br.submit()
except:
control = br.form.find_control(form_username)
for item in control.items: #Dropdown menu username form
if item.name == username:
item.selected = True
br.form[form_password] = password #Fill in the blank password form
br.submit()
html = br.open(url_after_login).read()
return html
Why is mechanize doing this?
Mechanize 可能不是,但浏览器是。我的猜测是该站点使用 Javascript,机械化不支持它,因此您会得到原始形式的 HTML,即执行任何 Javascript 之前的内容。
And how can I fix it?
不适用于 mechanize,但您需要一些支持 Javascript 的解决方案。有关详细信息和可能的解决方案,请参阅 Mechanize and Javascript。
这是我如何同时获得 HTML 和 Javascript 代码的解决方案。
我使用了 selenium 库。
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
#Using Firefox 48.0.2 and the new WebDriver
caps = DesiredCapabilities.FIREFOX
caps["marionette"] = True
br = webdriver.Firefox(capabilities=caps)
br.get('http://XXX.XXX.XXX.XXX/')
#Input Username and Password
username = br.find_element_by_name('SOME_NAME')
username.send_keys('USERNAME')
password = br.find_element_by_name('SOME_NAME')
password.send_keys('PASSWORD')
form = br.find_element_by_name('submitButton')
form.submit()
time.sleep(20)
#THIS IS WHAT IS DIFFERENT...
td_element = br.find_element_by_xpath('/html')
html = br.execute_script("return arguments[0].innerHTML;", td_element)
print html