使用 Nokogiri 抓取的网页 returns 无数据

Web page scraped with Nokogiri returns no data

我正试图从英国政府的 UK Oil Portal 中抓取一个项目列表,但我的代码 returns 没有数据。相反,我想制作一个项目标题数组。

class Entry
  def initialize(title)
    @title = title
  end
  attr_reader :title
end

def index
  @projects=Project.all
  require 'open-uri'
  require 'nokogiri'
  doc = Nokogiri::HTML(open("https://itportal.decc.gov.uk/pathfinder/currentprojectsindex.html"))

  entries = doc.css('.operator-container')
  @entries = []
  entries.each do |row|
    title = row.css('.setoutForm').text
    @entries << Entry.new(title)
  end
end

您发布的 link 不包含任何数据。您看到的页面是一个框架集,每个框架都由自己创建 URL。你想解析左框架,所以你应该编辑你的代码来打开左框架的URL:

  doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))

各个项目在不同的页面上,您需要打开每个项目。例如第一个是:

project_file = open(entries.first.css('a').attribute('href').value)       
project_doc = Nokogiri::HTML(project_file)

"setoutForm" class 抓取了大量文本。例如:

> project_doc.css('.setoutForm').text
=> "\n            \n              Field Type\n              Location\n              Water De
pth (m)\n              First Production\n              Contact\n            \n            \n
              Oil\n              2/15\n              155m\n              Q3/2018\n          
    \n                John Gill\n                Business Development Manager\n             
   jgill@alphapetroleum.com\n                01483 307204\n              \n            \n   
       \n            \n              Project Summary\n            \n            \n          
    \n                The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n 
               \n                Reserves are approximately 46mmbbls oil.\n                \
n                A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n                \n              \n            \n          "   

但是标题不在该文本中。如果你想要标题,请抓取页面的这一部分:

<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>

你可以用这个 CSS 选择器做什么:

> project_doc.css('.operator-container .field-header').text
=> "Cheviot"

逐步编写此代码。很难找出你的代码哪里出错了,除非你 single-step 它。比如我用Nokogiri的command line tool打开一个交互Rubyshell,用

nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index