在 Ruby 中使用 Nokogiri 抓取特定标题

Question

我目前正在使用 NYT Best Sellers 网站练习网页抓取。我想获得列表中排名第一的书的书名，并找到了 HTML 元素：

<div class="book-body">
  <p class="freshness">12 weeks on the list</p>
  <h3 class="title" itemprop="name">CRAZY RICH ASIANS</h3>
  <p class="author" itemprop="author">by Kevin Kwan</p>
  <p itemprop="description" class="description">A New Yorker gets a surprise when she spends the summer with her boyfriend in Singapore.</p>
</div>

我正在使用以下代码来获取特定文本：

doc.css(".title").text

但是，它 returns 列表中每本书的书名。我将如何只获得特定的书名，"CRAZY RICH ASIANS"？

Answer 1

如果您查看 doc.css(".title") 中的 return，您会发现它是所有标题的 collection。作为Nokogiri::XML::ElementObjects

CSS 据我所知没有 selector 用于定位给定 class 的第一个元素。（如果我错了，肯定有人会纠正我）但是从 Nokogiri::XML::NodeSet 中获取第一个元素仍然非常简单，因为它在许多情况下就像 Array 一样。例如：

doc.css(".title")[0].text

您也可以使用 xpath 来 select 只是第一个（因为 XPath 确实支持基于索引的 selection），如下所示：

doc.xpath(doc.xpath("(//h3[@class='title'])[1]").text

请注意：

Ruby 索引从第一个示例中的 0 开始；
与第二个示例一样，XPath 索引从 1 开始。

在 Ruby 中使用 Nokogiri 抓取特定标题

Scraping for a specific title using Nokogiri in Ruby

html

ruby

screen-scraping

nokogiri

web-scraping