使用 Nokogiri 抓取的网页 returns 无数据
Web page scraped with Nokogiri returns no data
我正试图从英国政府的 UK Oil Portal 中抓取一个项目列表,但我的代码 returns 没有数据。相反,我想制作一个项目标题数组。
class Entry
def initialize(title)
@title = title
end
attr_reader :title
end
def index
@projects=Project.all
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("https://itportal.decc.gov.uk/pathfinder/currentprojectsindex.html"))
entries = doc.css('.operator-container')
@entries = []
entries.each do |row|
title = row.css('.setoutForm').text
@entries << Entry.new(title)
end
end
您发布的 link 不包含任何数据。您看到的页面是一个框架集,每个框架都由自己创建 URL。你想解析左框架,所以你应该编辑你的代码来打开左框架的URL:
doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))
各个项目在不同的页面上,您需要打开每个项目。例如第一个是:
project_file = open(entries.first.css('a').attribute('href').value)
project_doc = Nokogiri::HTML(project_file)
"setoutForm" class 抓取了大量文本。例如:
> project_doc.css('.setoutForm').text
=> "\n \n Field Type\n Location\n Water De
pth (m)\n First Production\n Contact\n \n \n
Oil\n 2/15\n 155m\n Q3/2018\n
\n John Gill\n Business Development Manager\n
jgill@alphapetroleum.com\n 01483 307204\n \n \n
\n \n Project Summary\n \n \n
\n The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n
\n Reserves are approximately 46mmbbls oil.\n \
n A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n \n \n \n "
但是标题不在该文本中。如果你想要标题,请抓取页面的这一部分:
<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>
你可以用这个 CSS 选择器做什么:
> project_doc.css('.operator-container .field-header').text
=> "Cheviot"
逐步编写此代码。很难找出你的代码哪里出错了,除非你 single-step 它。比如我用Nokogiri的command line tool打开一个交互Rubyshell,用
nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index
我正试图从英国政府的 UK Oil Portal 中抓取一个项目列表,但我的代码 returns 没有数据。相反,我想制作一个项目标题数组。
class Entry
def initialize(title)
@title = title
end
attr_reader :title
end
def index
@projects=Project.all
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("https://itportal.decc.gov.uk/pathfinder/currentprojectsindex.html"))
entries = doc.css('.operator-container')
@entries = []
entries.each do |row|
title = row.css('.setoutForm').text
@entries << Entry.new(title)
end
end
您发布的 link 不包含任何数据。您看到的页面是一个框架集,每个框架都由自己创建 URL。你想解析左框架,所以你应该编辑你的代码来打开左框架的URL:
doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))
各个项目在不同的页面上,您需要打开每个项目。例如第一个是:
project_file = open(entries.first.css('a').attribute('href').value)
project_doc = Nokogiri::HTML(project_file)
"setoutForm" class 抓取了大量文本。例如:
> project_doc.css('.setoutForm').text
=> "\n \n Field Type\n Location\n Water De
pth (m)\n First Production\n Contact\n \n \n
Oil\n 2/15\n 155m\n Q3/2018\n
\n John Gill\n Business Development Manager\n
jgill@alphapetroleum.com\n 01483 307204\n \n \n
\n \n Project Summary\n \n \n
\n The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n
\n Reserves are approximately 46mmbbls oil.\n \
n A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n \n \n \n "
但是标题不在该文本中。如果你想要标题,请抓取页面的这一部分:
<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>
你可以用这个 CSS 选择器做什么:
> project_doc.css('.operator-container .field-header').text
=> "Cheviot"
逐步编写此代码。很难找出你的代码哪里出错了,除非你 single-step 它。比如我用Nokogiri的command line tool打开一个交互Rubyshell,用
nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index