Nokogiri 解析缺失元素创建问题

Nokogiri parsing missing element create issue

我有 Plain html doc NO CSS。其中一些内容我需要传递给excelsheet。我尝试使用 Nokogiri,它在 Css 基础上工作。

有没有人试过这个东西。

<html>
 <head></head>
  <body>
    ***NOTE***
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [78945824] PO Number : [4587958]
   <br>
       Track It : <a href="abc.com"> 12345</a>
   <br>
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [79546828] PO Number : [4567892]
   <br>

   <br>
   <br>
      Items 
   <br>
   <br>
      Invoice Number : [78976824] PO Number : [897569]
   <br>
      Track It : <a href="abc.com"> 12345</a>
   <br>
   </body>
   </html>

我能够检索采购订单号和跟踪号

  require 'rubygems'
require 'nokogiri'   
require 'open-uri'

PAGE_URL = "a.html"

page = Nokogiri::HTML(open(PAGE_URL))
    data = page.css("body").text

    po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten
    tracking_numbers = page.css("a").text.split

    [["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
 puts po_numbers
 puts tracking_numbers


=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]

当我们将它们压缩在一起时,我们得到:

=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]

What we want is:

=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]

如果您可以使用正则表达式扫描所有发票编号 (po_numbers),您也可以使用跟踪编号 (tracking_numbers):

tracking_numbers = data.scan(/Tracking no : (\d*)/).flatten

返回的数组包含 nil,因此,您可以遍历 po 编号和跟踪编号的数组

po_numbers.each_with_index do |elm, index| 
  p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end

更新

此正则表达式匹配更新后的 HTML

/Track It :\s*(?:<a href=".*">\s*(\d+)\s*<\/a>|$)/

匹配空曲目号和a link.

试试这个

data = page.css("body").text
data = data.gsub(" ","").split(/\n/)
po=[]
track=[]
data.each do |i|
  if i.include? "PONumber"
    po << i.split("PONumber:").last.scan(/\d+/)[0]
  end
  if i.include? "TrackIt"
    track << i.split("TrackIt:").last
  end
end
po.zip(track)