Python 使用 __doPostBack 功能机械化导航
Python mechanize navigation using __doPostBack functions
如果 table 使用 __doPostBack 函数,我如何使用 mechanize 浏览网页上的 table?
我的代码是:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1")
page_num = 2
for link in br.links():
if link.text == str(page_num):
br.open(link) #I suspect this is not correct
break
for link in br.links():
print link.text, link.url
搜索 table 中的所有控件(例如下拉菜单)不显示页面按钮,而是搜索 table 中的所有 link做。页面按钮不包含 URL,因此它不是典型的 link。我收到 TypeError: expected string or buffer.
我的印象是这可以使用机械化来完成。
感谢阅读。
Mechanize 可用于导航使用 __doPostBack 的 table。我使用 BeautifulSoup 来解析 HTML 以获得所需的参数,并遵循有用的 。我的代码写在下面。
import mechanize
import re # write a regex to get the parameters expected by __doPostBack
from bs4 import BeautifulSoup
from time import sleep
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1")
# satisfy the __doPostBack function to navigate to different pages
for pg in range(2,5):
br.select_form(nr=0) # the only form on the page
br.set_all_readonly(False) # to set the __doPostBack parameters
# BeautifulSoup for parsing
soup = BeautifulSoup(response, 'lxml')
table = soup.find('table', {'class': 'RegulatedEntities'})
records = table.find_all('tr', {'style': ["background-color:#E4E3E3;border-style:None;", "border-style:None;"]})
for rec in records[:1]:
print 'Company name:', rec.a.string
# disable 'Search' and 'Clear filters'
for control in br.form.controls[:]:
if control.type in ['submit', 'image', 'checkbox']:
control.disabled = True
# get parameters for the __doPostBack function
for link in soup("a"):
if link.string == str(page):
next = re.search("""<a href="javascript:__doPostBack\('(.*?)','(.*?)'\)">""", str(link))
br["__EVENTTARGET"] = next.group(1)
br["__EVENTARGUMENT"] = next.group(2)
sleep(1)
response = br.submit()
如果 table 使用 __doPostBack 函数,我如何使用 mechanize 浏览网页上的 table?
我的代码是:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1")
page_num = 2
for link in br.links():
if link.text == str(page_num):
br.open(link) #I suspect this is not correct
break
for link in br.links():
print link.text, link.url
搜索 table 中的所有控件(例如下拉菜单)不显示页面按钮,而是搜索 table 中的所有 link做。页面按钮不包含 URL,因此它不是典型的 link。我收到 TypeError: expected string or buffer.
我的印象是这可以使用机械化来完成。
感谢阅读。
Mechanize 可用于导航使用 __doPostBack 的 table。我使用 BeautifulSoup 来解析 HTML 以获得所需的参数,并遵循有用的
import mechanize
import re # write a regex to get the parameters expected by __doPostBack
from bs4 import BeautifulSoup
from time import sleep
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1")
# satisfy the __doPostBack function to navigate to different pages
for pg in range(2,5):
br.select_form(nr=0) # the only form on the page
br.set_all_readonly(False) # to set the __doPostBack parameters
# BeautifulSoup for parsing
soup = BeautifulSoup(response, 'lxml')
table = soup.find('table', {'class': 'RegulatedEntities'})
records = table.find_all('tr', {'style': ["background-color:#E4E3E3;border-style:None;", "border-style:None;"]})
for rec in records[:1]:
print 'Company name:', rec.a.string
# disable 'Search' and 'Clear filters'
for control in br.form.controls[:]:
if control.type in ['submit', 'image', 'checkbox']:
control.disabled = True
# get parameters for the __doPostBack function
for link in soup("a"):
if link.string == str(page):
next = re.search("""<a href="javascript:__doPostBack\('(.*?)','(.*?)'\)">""", str(link))
br["__EVENTTARGET"] = next.group(1)
br["__EVENTARGUMENT"] = next.group(2)
sleep(1)
response = br.submit()