使用 BeautifulSoup 在多个页面上编写循环
Writing loop over multiple pages with BeautifulSoup
我正在尝试从此处的县搜索工具中抓取几页结果:http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main
但我似乎无法弄清楚除了第一页之外如何迭代。
import csv
from mechanize import Browser
from bs4 import BeautifulSoup
url = 'http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main'
br = Browser()
br.set_handle_robots(False)
br.open(url)
br.select_form("county_search_form")
br.form['county_select'] = ['111111111111180']
br.form['start_date_month'] = ['1']
br.form['start_date_day'] = ['1']
br.form['start_date_year'] = ['2014']
br.submit()
soup = BeautifulSoup(br.response())
complaints = soup.find('table', class_='waciList')
output = []
import requests
for i in xrange(1,8):
page = requests.get("http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.search&pageNumber={}".format(i))
if not page.ok:
continue
soup = BeautifulSoup(requests.text)
for tr in complaints.findAll('tr'):
print tr
output_row = []
for td in tr.findAll('td'):
output_row.append(td.text.strip())
output.append(output_row)
br.open(url)
print 'page 2'
complaints = soup.find('table', class_='waciList')
for tr in complaints.findAll('tr'):
print tr
with open('out-tceq.csv', 'w') as csvfile:
my_writer = csv.writer(csvfile, delimiter='|')
my_writer.writerows(output)
我在输出 CSV 中得到的只是第一页的结果。在查看其他使用 bs4 的抓取示例后,我尝试添加导入请求循环,但收到错误消息 'ImportError: No module named requests.'
关于我应该如何遍历所有八页结果以将它们放入 .csv 中的任何想法?
您实际上不需要 requests
模块来遍历分页搜索结果,mechanize
已经足够了。这是使用 mechanize
.
的一种可能方法
首先,获取当前页面的所有分页 links :
links = br.links(url_regex=r"fuseaction=home.search&pageNumber=")
然后遍历分页 links,打开每个 link 并在每次迭代中从每个页面收集有用的信息:
for link in links:
#open link url:
br.follow_link(link)
#print url of current page, just to make sure we are on the expected page:
print(br.geturl())
#create soup from HTML of previously opened link:
soup = BeautifulSoup(br.response())
#TODO: gather information from current soup object here
我正在尝试从此处的县搜索工具中抓取几页结果:http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main
但我似乎无法弄清楚除了第一页之外如何迭代。
import csv
from mechanize import Browser
from bs4 import BeautifulSoup
url = 'http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main'
br = Browser()
br.set_handle_robots(False)
br.open(url)
br.select_form("county_search_form")
br.form['county_select'] = ['111111111111180']
br.form['start_date_month'] = ['1']
br.form['start_date_day'] = ['1']
br.form['start_date_year'] = ['2014']
br.submit()
soup = BeautifulSoup(br.response())
complaints = soup.find('table', class_='waciList')
output = []
import requests
for i in xrange(1,8):
page = requests.get("http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.search&pageNumber={}".format(i))
if not page.ok:
continue
soup = BeautifulSoup(requests.text)
for tr in complaints.findAll('tr'):
print tr
output_row = []
for td in tr.findAll('td'):
output_row.append(td.text.strip())
output.append(output_row)
br.open(url)
print 'page 2'
complaints = soup.find('table', class_='waciList')
for tr in complaints.findAll('tr'):
print tr
with open('out-tceq.csv', 'w') as csvfile:
my_writer = csv.writer(csvfile, delimiter='|')
my_writer.writerows(output)
我在输出 CSV 中得到的只是第一页的结果。在查看其他使用 bs4 的抓取示例后,我尝试添加导入请求循环,但收到错误消息 'ImportError: No module named requests.'
关于我应该如何遍历所有八页结果以将它们放入 .csv 中的任何想法?
您实际上不需要 requests
模块来遍历分页搜索结果,mechanize
已经足够了。这是使用 mechanize
.
首先,获取当前页面的所有分页 links :
links = br.links(url_regex=r"fuseaction=home.search&pageNumber=")
然后遍历分页 links,打开每个 link 并在每次迭代中从每个页面收集有用的信息:
for link in links:
#open link url:
br.follow_link(link)
#print url of current page, just to make sure we are on the expected page:
print(br.geturl())
#create soup from HTML of previously opened link:
soup = BeautifulSoup(br.response())
#TODO: gather information from current soup object here