我怎么能刮这个？

Question

我需要抓取此页面（有一个表单）：http://kllads.kar.nic.in/MLAWise_reports.aspx, with Python preferably (if not Python, then JavaScript). I was looking at libraries like RoboBrowser (which is basically Mechanize + BeautifulSoup) and (maybe) Selenium 但我不太确定如何去做。从检查元素来看，它似乎是我需要填写的WebForm。填写后，网页会生成一些我需要存储的数据。我应该怎么做？

Answer 1

您可以在 Selenium 中相对轻松地与 javascript 网络表单进行交互。您可能需要快速安装网络驱动程序，但除此之外，您需要做的就是使用其 xpath 找到表单，然后使用选项的 xpath 从下拉菜单中选择 Selenium select 选项。对于提供的网页，它看起来像这样：

#import functions from selenium module
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# open chrome browser using webdriver
path_to_chromedriver = '/Users/Michael/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path=path_to_chromedriver)

# open web page using browser
browser.get('http://kllads.kar.nic.in/MLAWise_reports.aspx')

# wait for page to load then find 'Constituency Name' dropdown and select 'Aland (46)''
const_name = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ddlconstname"]')))
browser.find_element_by_xpath('//*[@id="ddlconstname"]/option[2]').click()

# wait for the page to load then find 'Select Status' dropdown and select 'OnGoing'
sel_status = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ddlstatus1"]')))
browser.find_element_by_xpath('//*[@id="ddlstatus1"]/option[2]').click()

# wait for browser to load then click 'Generate Report'
gen_report = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="BtnReport"]')))
browser.find_element_by_xpath('//*[@id="BtnReport"]').click()

在每次交互之间，您只是在尝试单击下一个元素之前给浏览器一些加载时间。填写完所有表格后，页面将根据选项 selected 显示数据，您应该能够抓取 table 数据。我在尝试为第一个选区名称选项加载数据时遇到了一些问题，但其他选项似乎工作正常。

您还应该能够遍历每个 Web 表单下可用的所有下拉选项以显示所有数据。

希望对您有所帮助！

我怎么能刮这个？

How can I scrape this?

python

selenium

mechanize

web-scraping

robobrowser