如何摆脱数组中的幻影行？

Question

我正在使用 httparty 抓取一堆表格，然后使用 nokogiri 解析响应。一切正常，但随后我在顶部出现了一个幻影行：

require 'nokogiri'
require 'httparty'
require 'byebug'
def scraper
    url = "https://github.com/public-apis/public-apis"
    parsed_page = Nokogiri::HTML(HTTParty.get(url))
    # Get categories from the ul at the top
    categories = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/ul/li/a')
    # Get all tables from the page
    tables = parsed_page.xpath('/html/body/div[4]/div/main/div[2]/div/div/div/article/table')
    rows = []
    # Acting on one first for testing before making it dynamic 
    tables[0].search('tr').each do |tr|
        cells = tr.search('td')
        link = ''
        values = []
        row = {
            'name' => '',
            'description' => '',
            'auth' => '',
            'https' => '',
            'cors' => '',
            'category' => '',
            'url' => ''
        }
        cells.css('a').each do |a|
            link += a['href']
        end
        cells.each do |cell|
            values << cell.text
        end
        values << categories[0].text
        values << link
        rows << row.keys.zip(values).to_h
    end
    puts rows
end
scraper

控制台结果：

{"name"=>"Animals", "description"=>"", "auth"=>nil, "https"=>nil, "cors"=>nil, "category"=>nil, "url"=>nil}
{"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", 
...

第一行来自哪里？

Answer 1

您看到的第一行很可能是 header 行。 Header 行使用 <th> 而不是 <td>。这意味着 cells = tr.search('td') 将是 header 行的空 collection。

在大多数情况下，header 行放在 <thead> 中，数据行放在 <tbody> 中。所以你可以做 tables[0].search('tbody tr') 而不是 tables[0].search('tr')，它只选择 <tbody> 标签中的行。

Answer 2

您的代码可以更简单、更有弹性：

对此进行冥想：

require 'nokogiri'
require 'httparty'

URL = 'https://github.com/public-apis/public-apis'
FIELDS = %w[name description auth https cors category url]

doc = Nokogiri::HTML(HTTParty.get(URL))

category = doc.at('article li a').text

rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
  values = tr.search('td').map(&:text)
  link = tr.at('a')['href']
  Hash[
    FIELDS.zip(values + [category, link])
  ]
}

这导致：

puts rows

# >> {"name"=>"Cat Facts", "description"=>"Daily cat facts", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://alexwohlbruck.github.io/cat-facts/"}
# >> {"name"=>"Cats", "description"=>"Pictures of cats from Tumblr", "auth"=>"apiKey", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://docs.thecatapi.com/"}
# >> {"name"=>"Dogs", "description"=>"Based on the Stanford Dogs Dataset", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://dog.ceo/dog-api/"}
# >> {"name"=>"HTTPCat", "description"=>"Cat for every HTTP Status", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://http.cat/"}
# >> {"name"=>"IUCN", "description"=>"IUCN Red List of Threatened Species", "auth"=>"apiKey", "https"=>"No", "cors"=>"Unknown", "category"=>"Animals", "url"=>"http://apiv3.iucnredlist.org/api/v3/docs"}
# >> {"name"=>"Movebank", "description"=>"Movement and Migration data of animals", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://github.com/movebank/movebank-api-doc"}
# >> {"name"=>"Petfinder", "description"=>"Adoption", "auth"=>"OAuth", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://www.petfinder.com/developers/v2/docs/"}
# >> {"name"=>"PlaceGOAT", "description"=>"Placeholder goat images", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://placegoat.com/"}
# >> {"name"=>"RandomCat", "description"=>"Random pictures of cats", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://aws.random.cat/meow"}
# >> {"name"=>"RandomDog", "description"=>"Random pictures of dogs", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"https://random.dog/woof.json"}
# >> {"name"=>"RandomFox", "description"=>"Random pictures of foxes", "auth"=>"No", "https"=>"Yes", "cors"=>"No", "category"=>"Animals", "url"=>"https://randomfox.ca/floof/"}
# >> {"name"=>"RescueGroups", "description"=>"Adoption", "auth"=>"No", "https"=>"Yes", "cors"=>"Unknown", "category"=>"Animals", "url"=>"https://userguide.rescuegroups.org/display/APIDG/API+Developers+Guide+Home"}
# >> {"name"=>"Shibe.Online", "description"=>"Random pictures of Shibu Inu, cats or birds", "auth"=>"No", "https"=>"Yes", "cors"=>"Yes", "category"=>"Animals", "url"=>"http://shibe.online/"}

您的代码存在的问题是：

仅使用 search('some selector')[0] is the same as at('some selector') 第二个更干净，从而减少视觉噪音。

search returns 与 at 相比，还有其他更细微的差异，这在文档中有所介绍。我强烈建议您阅读并试验他们的示例，因为知道何时使用哪个可以让您省去麻烦。
依赖绝对 XPath 选择器：绝对选择器非常脆弱。对 HTML 的任何更改都会导致 high-likelihood 中断。相反，找到有用的节点来检查它们是否是唯一的，然后让解析器找到它们。

使用 CSS 选择器 'article li a' 跳过所有节点，直到找到 "article" 节点，在其中查找 child "li" 和关注 "a"。您可以使用 XPath 做同样的事情，但它在视觉上很嘈杂。我非常喜欢让我的代码尽可能易于阅读和理解。

类似地，at('article table') 找到 "article" 节点下的第一个 table，然后 search('tr') 仅在 table 中找到嵌入的行。

因为您想跳过 table header [1..-1] 切片 NodeSet 并跳过第一行。
map 更容易构建结构：
```
rows = doc.at('article table').search('tr')[1..-1].map { |tr| 
```
一次通过该行循环将字段分配给 rows。

values 分配了每个 "td" 节点文本的 NodeSet 文本。
您可以通过使用 Hash's [] 构造函数并传入一个 key/value 对数组来轻松构建哈希。
```
FIELDS.zip(values + [category, link])
```
正在从单元格中获取值并添加第二个数组，其中包含类别和行中的 link。

我的示例代码基本上是相同的模板每次我用 table 抓取页面。会有细微差别，但它是 table 上的一个循环，提取单元格并将它们转换为散列。甚至有可能，在一个干净的 table 上，自动从 table.

第一行的单元格文本中获取散列键。

如何摆脱数组中的幻影行？

How to get rid of phantom row in array?

ruby

arrays

hash

nokogiri