使用 Nokogiri 抓取多个 table 行兄弟姐妹

Scraping multiple table row siblings with Nokogiri

我正在尝试使用以下标记解析 table。

<table>
  <tr class="athlete">
    <td colspan="2" class="name">Alex</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="run">
    <td>5.20</td>
    <td>10.50</td>
  </tr>
  <tr class="end"></tr>
  <tr class="athlete">
    <td colspan="2" class="name">John</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="end"></tr>
</table>

我需要遍历每个 .athlete table 行并获取下面的每个兄弟 .运行 table 行,直到到达 .end 行。然后为下一位运动员重复,依此类推。一些 .athlete 行有两个 .运行 行,其他的有一个。

这是我目前所拥有的。我循环遍历运动员:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://myurl.com"
doc = Nokogiri::HTML(open(url))

doc.css(".athlete").each do |athlete|
  puts athlete.at_css("name").text
  # Loop through the sibling .run rows until I reach the .end row
  # output the value of the td’s in the .run row
end

我不知道如何获取每个兄弟 .运行 行,并停在 .end 行。我觉得如果 table 的格式更好,会更容易,但不幸的是我无法控制标记。任何帮助将不胜感激!

我会按如下方式处理 table:

  1. 找到您要处理的table

    table = doc.at_css("table")
    
  2. 获取table

    中的所有直接行
    rows = table.css("> tr")
    
  3. 将边界为 .athlete.end

    的行分组
    grouped = [[]]
    rows.each do |row|
      if row['class'] == 'athlete' and grouped.last.empty?
        grouped.last << row
      elsif row['class'] == 'end' and not grouped.last.empty?
        grouped.last << row
        grouped << []
      elsif not grouped.last.empty?
        grouped.last << row
      end
    end
    grouped.pop if grouped.last.empty? || grouped.last.last['class'] != 'end'
    
  4. 处理分组的行

    grouped.each do |group|
      puts "BEGIN: >> #{group.first.text} <<"
      group[1..-2].each do |row|
        puts "  #{row.text.squeeze}"
      end
      puts "END: >> #{group.last.text} <<"
    end
    

require 'nokogiri'

doc = <<DOC
<table>
  <tr class="athlete">
    <td colspan="2" class="name">Alex</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="run">
    <td>5.20</td>
    <td>10.50</td>
  </tr>
  <tr class="end"></tr>
  <tr class="athlete">
    <td colspan="2" class="name">John</td>
  </tr>
  <tr class="run">
    <td>5.00</td>
    <td>10.00</td>
  </tr>
  <tr class="end"></tr>
</table>
DOC

doc = Nokogiri::HTML(doc)
# You can exclude .end, if it is always empty? and not required
trs = doc.css('.athlete, .run, .end').to_a
# This will return [['athlete', 'run', ...,'end'], ['athlete', 'run', ...,'end'] ...]
athletes = trs.slice_before{ |elm| elm.attr('class') =='athlete' }.to_a

athletes.map! do |athlete|
    {
        name: athlete.shift.at_css('.name').text,
        runs: athlete
        .select{ |tr| tr.attr('class') == 'run' }
        .map{|run| run.text.to_f }
    }
end

puts athletes.inspect
#[{:name=>"Alex", :runs=>[5.0, 5.2]}, {:name=>"John", :runs=>[5.0]}]