如何摆脱数组中的幻影行?

How to get rid of phantom row in array?

我正在使用 httparty 抓取一堆表格,然后使用 nokogiri 解析响应。一切正常,但随后我在顶部出现了一个幻影行:

require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
    url = "https://github.com/public-apis/public-apis"
    parsed_page = Nokogiri::HTML(HTTParty.get(url))
    # Get categories from the ul at the top
    categories = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/ul/li/a')
    # Get all tables from the page
    tables = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/table')
    rows = []
    # Acting on one first for testing before making it dynamic 
    tables[0].search('tr').each do |tr|
        cells = tr.search('td')
        link = ''
        values = []
        row = {
            'name' => '',
            'description' => '',
            'auth' => '',
            'https' => '',
            'cors' => '',
            'category' => '',
            'url' => ''
        }
        cells.css('a').each do |a|
            link += a['href']
        end
        cells.each do |cell|
            values << cell.text
        end
        values << categories[0].text
        values << link
        rows << row.keys.zip(values).to_h
    end
    puts rows
end
scraper

控制台结果:

{"name"=>"Animals", "description"=>"", "auth"=>nil, "https"=>nil, "cors"=>nil, "category"=>nil, "url"=>nil}
{"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", 
...

第一行来自哪里?

您看到的第一行很可能是 header 行。 Header 行使用 <th> 而不是 <td>。这意味着 cells = tr.search('td') 将是 header 行的空 collection。

在大多数情况下,header 行放在 <thead> 中,数据行放在 <tbody> 中。所以你可以做 tables[0].search('tbody tr') 而不是 tables[0].search('tr'),它只选择 <tbody> 标签中的行。

您的代码可以更简单、更有弹性:

对此进行冥想:

require 'nokogiri'
require 'httparty'

URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]

doc = Nokogiri::HTML(HTTParty.get(URL))

category = doc.at('article li a').text

rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
  values = tr.search('td').map(&:text)
  link = tr.at('a')['href']
  Hash[
    FIELDS.zip(values + [category, link])
  ]
}

这导致:

puts rows

# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}

您的代码存在的问题是:

  • 仅使用 search('some selector')[0] is the same as at('some selector') 第二个更干净,从而减少视觉噪音。

    search returns 与 at 相比,还有其他更细微的差异,这在文档中有所介绍。我强烈建议您阅读并试验他们的示例,因为知道何时使用哪个可以让您省去麻烦。

  • 依赖绝对 XPath 选择器:绝对选择器非常脆弱。对 HTML 的任何更改都会导致 high-likelihood 中断。相反,找到有用的节点来检查它们是否是唯一的,然后让解析器找到它们。

    使用 CSS 选择器 'article li a' 跳过所有节点,直到找到 "article" 节点,在其中查找 child "li" 和关注 "a"。您 可以 使用 XPath 做同样的事情,但它在视觉上很嘈杂。我非常喜欢让我的代码尽可能易于阅读和理解。

    类似地,at('article table') 找到 "article" 节点下的第一个 table,然后 search('tr') 仅在 table 中找到嵌入的行。

    因为您想跳过 table header [1..-1] 切片 NodeSet 并跳过第一行。

  • map 更容易构建结构:

    rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
    

    一次通过该行循环将字段分配给 rows

    values 分配了每个 "td" 节点文本的 NodeSet 文本。

  • 您可以通过使用 Hash's [] 构造函数并传入一个 key/value 对数组来轻松构建哈希。

    FIELDS.zip(values + [category, link])
    

    正在从单元格中获取值并添加第二个数组,其中包含类别和行中的 link。

我的示例代码基本上是相同的模板 每次 我用 table 抓取页面。会有细微差别,但它是 table 上的一个循环,提取单元格并将它们转换为散列。甚至有可能,在一个干净的 table 上,自动从 table.

第一行的单元格文本中获取散列键。