来自所有搜索结果页面的数据抓取信息
Data scraping information from all of the search results pages
我正在尝试从 UCAS 网站抓取数据,以显示从基本搜索返回的所有页面中的所有 Uni 名称。
到目前为止,在没有循环工作的情况下,它显示了第一页所有大学的名称以及一些随机信息,如下所示:
"The University of Aberdeen
Abertay University
Aberystwyth University
ABI College
Abingdon and Witney College
The Academy of Contemporary Music
Access to Music
Accrington & Rossendale College
Activate Learning (Oxford, Reading, Banbury & Bicester)
The College of Agriculture, Food and Rural Enterprise
Amersham & Wycombe College
Amsterdam Fashion Academy
Anglia Ruskin University
Anglo European College of Chiropractic
Arden University (RDI)
University of the Arts London
Arts University Bournemouth (formerly University College)
ARU London
Askham Bryan College
Aston University, Birmingham
Availability
Applying through Extra
Single/Combined subjects
Provider types
How you study
Qualification level
Conservatoire specialism"
这是我的代码:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'mechanize'
mechanize = Mechanize.new
doc = mechanize.get('http://search.ucas.com/')
form = doc.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
doc = form.submit
doc.search('li.results clearfix').each do |h3|
puts h3.text.strip
while a = doc.at('div.pagerclearfix a')
doc = Nokogiri::HTML(open(a[:href]))
doc.search('results clearfix').each do |h3|
puts h3.text.strip
end
end
end
您不需要 require 'rubygems'
,因为这是一种反模式。您不需要 require 'nokogiri'
,因为它是 Mechanize 所要求的,您也不需要 OpenURI。
分页不起作用,因为 div.pagerclearfix
选择器不匹配任何内容,因为 pager
和 clearfix
是分开的 类。另外 while
循环在错误的位置,它不应该在打印结果的 each
循环内。
你最终应该得到的是这样的:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://search.ucas.com/')
form = page.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
page = form.submit
page.search('li.result h3').each do |h3|
puts h3.text.strip
end
while next_page_link = page.at('.pager a[text()=">"]')
page = mechanize.get(next_page_link['href'])
page.search('li.result h3').each do |h3|
puts h3.text.strip
end
end
您可以通过多种方式实现分页,搜索 "next page" 链接通常是最直接的。
我正在尝试从 UCAS 网站抓取数据,以显示从基本搜索返回的所有页面中的所有 Uni 名称。
到目前为止,在没有循环工作的情况下,它显示了第一页所有大学的名称以及一些随机信息,如下所示:
"The University of Aberdeen
Abertay University
Aberystwyth University
ABI College
Abingdon and Witney College
The Academy of Contemporary Music
Access to Music
Accrington & Rossendale College
Activate Learning (Oxford, Reading, Banbury & Bicester)
The College of Agriculture, Food and Rural Enterprise
Amersham & Wycombe College
Amsterdam Fashion Academy
Anglia Ruskin University
Anglo European College of Chiropractic
Arden University (RDI)
University of the Arts London
Arts University Bournemouth (formerly University College)
ARU London
Askham Bryan College
Aston University, Birmingham
Availability
Applying through Extra
Single/Combined subjects
Provider types
How you study
Qualification level
Conservatoire specialism"
这是我的代码:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'mechanize'
mechanize = Mechanize.new
doc = mechanize.get('http://search.ucas.com/')
form = doc.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
doc = form.submit
doc.search('li.results clearfix').each do |h3|
puts h3.text.strip
while a = doc.at('div.pagerclearfix a')
doc = Nokogiri::HTML(open(a[:href]))
doc.search('results clearfix').each do |h3|
puts h3.text.strip
end
end
end
您不需要 require 'rubygems'
,因为这是一种反模式。您不需要 require 'nokogiri'
,因为它是 Mechanize 所要求的,您也不需要 OpenURI。
分页不起作用,因为 div.pagerclearfix
选择器不匹配任何内容,因为 pager
和 clearfix
是分开的 类。另外 while
循环在错误的位置,它不应该在打印结果的 each
循环内。
你最终应该得到的是这样的:
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get('http://search.ucas.com/')
form = page.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
page = form.submit
page.search('li.result h3').each do |h3|
puts h3.text.strip
end
while next_page_link = page.at('.pager a[text()=">"]')
page = mechanize.get(next_page_link['href'])
page.search('li.result h3').each do |h3|
puts h3.text.strip
end
end
您可以通过多种方式实现分页,搜索 "next page" 链接通常是最直接的。