无法使用机械化访问完整网页
can not access complete webpage using mechanize
我试图使用机械化保存 usautoforce 的主页。@Ertugrul 根据你的回答,我有完整的页面。但是当我试图访问用户名和密码字段时,它给出了一个错误。我已经将所有只读设置为假。当我在编辑器中打开网页时,没有 html 指的是用户名和密码
这是我在 mechanize 中的代码,
br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_robots(False)
#br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'), ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),('Upgrade-Insecure-Requests','1'),('Connection','keep-alive')]
br.open("http://www.usautoforce.com/Pages/home.aspx")
br.set_handle_robots(False)
print br.response
time.sleep(9)
latest_index = 0
html_replaced = ""
html = br.response().read()
for m in re.finditer('(href|src)(=")(/[^"]+")', html):
html_replaced += html[latest_index:m.start()] + m.groups()[0]+m.groups()[1] + 'http://www.usautoforce.com' + m.groups()[2]
latest_index = m.end()
f=open("us.html","w")
f.write(html_replaced)
f.close()
print [form for form in br.forms()][0]
br.set_handle_robots(False)
print br.response
time.sleep(9)
html = br.response().read()
br.select_form(nr=0)
time.sleep(2)
#for control in br.form.controls:
# print control
# print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])
br.form.set_all_readonly(False)
br.form["nexpartuname"] = "abc"
br.form["pwd"] = "xyz"
br.submit()
这里是错误:
File "haha.py", line 60, in <module>
br.form["nexpartuname"] = "clack"
File "/usr/lib/python2.7/site-packages/mechanize/_form.py", line 2775, in __setitem__
control = self.find_control(name)
File "/usr/lib/python2.7/site-packages/mechanize/_form.py", line 3096, in find_control
return self._find_control(name, type, kind, id, label, predicate, nr)
File "/usr/lib/python2.7/site-packages/mechanize/_form.py", line 3180, in _find_control
raise ControlNotFoundError("no control matching "+description)
mechanize._form.ControlNotFoundError: no control matching name 'nexpartuname'
Mechanize 不执行 javascript。您尝试访问的网站也显示 'Please enable scripts...'.
由于无法在mechanize中启用js,我个人建议您使用phantomjs。
但这里真正的问题不是 javascript,而是 url。由于该网站中的 url 是相对的,因此当您下载并打开 html 代码时,它的行为并不像预期的那样。
您必须将所有相对网址转换为绝对网址。在将 html 写入文件之前使用此代码。将 html_replaced str 而不是 html str 写入文件。
latest_index = 0
html_replaced = ""
for m in re.finditer('(href|src)(=")(/[^"]+")', html):
html_replaced += html[latest_index:m.start()] + m.groups()[0]+m.groups()[1] + 'http://www.usautoforce.com' + m.groups()[2]
latest_index = m.end()
我试图使用机械化保存 usautoforce 的主页。@Ertugrul 根据你的回答,我有完整的页面。但是当我试图访问用户名和密码字段时,它给出了一个错误。我已经将所有只读设置为假。当我在编辑器中打开网页时,没有 html 指的是用户名和密码 这是我在 mechanize 中的代码,
br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_robots(False)
#br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'), ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),('Upgrade-Insecure-Requests','1'),('Connection','keep-alive')]
br.open("http://www.usautoforce.com/Pages/home.aspx")
br.set_handle_robots(False)
print br.response
time.sleep(9)
latest_index = 0
html_replaced = ""
html = br.response().read()
for m in re.finditer('(href|src)(=")(/[^"]+")', html):
html_replaced += html[latest_index:m.start()] + m.groups()[0]+m.groups()[1] + 'http://www.usautoforce.com' + m.groups()[2]
latest_index = m.end()
f=open("us.html","w")
f.write(html_replaced)
f.close()
print [form for form in br.forms()][0]
br.set_handle_robots(False)
print br.response
time.sleep(9)
html = br.response().read()
br.select_form(nr=0)
time.sleep(2)
#for control in br.form.controls:
# print control
# print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])
br.form.set_all_readonly(False)
br.form["nexpartuname"] = "abc"
br.form["pwd"] = "xyz"
br.submit()
这里是错误:
File "haha.py", line 60, in <module>
br.form["nexpartuname"] = "clack"
File "/usr/lib/python2.7/site-packages/mechanize/_form.py", line 2775, in __setitem__
control = self.find_control(name)
File "/usr/lib/python2.7/site-packages/mechanize/_form.py", line 3096, in find_control
return self._find_control(name, type, kind, id, label, predicate, nr)
File "/usr/lib/python2.7/site-packages/mechanize/_form.py", line 3180, in _find_control
raise ControlNotFoundError("no control matching "+description)
mechanize._form.ControlNotFoundError: no control matching name 'nexpartuname'
Mechanize 不执行 javascript。您尝试访问的网站也显示 'Please enable scripts...'.
由于无法在mechanize中启用js,我个人建议您使用phantomjs。
但这里真正的问题不是 javascript,而是 url。由于该网站中的 url 是相对的,因此当您下载并打开 html 代码时,它的行为并不像预期的那样。
您必须将所有相对网址转换为绝对网址。在将 html 写入文件之前使用此代码。将 html_replaced str 而不是 html str 写入文件。
latest_index = 0
html_replaced = ""
for m in re.finditer('(href|src)(=")(/[^"]+")', html):
html_replaced += html[latest_index:m.start()] + m.groups()[0]+m.groups()[1] + 'http://www.usautoforce.com' + m.groups()[2]
latest_index = m.end()