Nokogiri 解析缺失元素创建问题
Nokogiri parsing missing element create issue
我有 Plain html doc NO CSS。其中一些内容我需要传递给excelsheet。我尝试使用 Nokogiri,它在 Css 基础上工作。
有没有人试过这个东西。
<html>
<head></head>
<body>
***NOTE***
<br>
Items
<br>
<br>
Invoice Number : [78945824] PO Number : [4587958]
<br>
Track It : <a href="abc.com"> 12345</a>
<br>
<br>
Items
<br>
<br>
Invoice Number : [79546828] PO Number : [4567892]
<br>
<br>
<br>
Items
<br>
<br>
Invoice Number : [78976824] PO Number : [897569]
<br>
Track It : <a href="abc.com"> 12345</a>
<br>
</body>
</html>
我能够检索采购订单号和跟踪号
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "a.html"
page = Nokogiri::HTML(open(PAGE_URL))
data = page.css("body").text
po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten
tracking_numbers = page.css("a").text.split
[["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
puts po_numbers
puts tracking_numbers
=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]
当我们将它们压缩在一起时,我们得到:
=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]
What we want is:
=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]
如果您可以使用正则表达式扫描所有发票编号 (po_numbers),您也可以使用跟踪编号 (tracking_numbers):
tracking_numbers = data.scan(/Tracking no : (\d*)/).flatten
返回的数组包含 nil,因此,您可以遍历 po 编号和跟踪编号的数组
po_numbers.each_with_index do |elm, index|
p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end
更新
此正则表达式匹配更新后的 HTML
/Track It :\s*(?:<a href=".*">\s*(\d+)\s*<\/a>|$)/
匹配空曲目号和a link.
试试这个
data = page.css("body").text
data = data.gsub(" ","").split(/\n/)
po=[]
track=[]
data.each do |i|
if i.include? "PONumber"
po << i.split("PONumber:").last.scan(/\d+/)[0]
end
if i.include? "TrackIt"
track << i.split("TrackIt:").last
end
end
po.zip(track)
我有 Plain html doc NO CSS。其中一些内容我需要传递给excelsheet。我尝试使用 Nokogiri,它在 Css 基础上工作。
有没有人试过这个东西。
<html>
<head></head>
<body>
***NOTE***
<br>
Items
<br>
<br>
Invoice Number : [78945824] PO Number : [4587958]
<br>
Track It : <a href="abc.com"> 12345</a>
<br>
<br>
Items
<br>
<br>
Invoice Number : [79546828] PO Number : [4567892]
<br>
<br>
<br>
Items
<br>
<br>
Invoice Number : [78976824] PO Number : [897569]
<br>
Track It : <a href="abc.com"> 12345</a>
<br>
</body>
</html>
我能够检索采购订单号和跟踪号
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "a.html"
page = Nokogiri::HTML(open(PAGE_URL))
data = page.css("body").text
po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten
tracking_numbers = page.css("a").text.split
[["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
puts po_numbers
puts tracking_numbers
=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]
当我们将它们压缩在一起时,我们得到:
=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]
What we want is:
=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]
如果您可以使用正则表达式扫描所有发票编号 (po_numbers),您也可以使用跟踪编号 (tracking_numbers):
tracking_numbers = data.scan(/Tracking no : (\d*)/).flatten
返回的数组包含 nil,因此,您可以遍历 po 编号和跟踪编号的数组
po_numbers.each_with_index do |elm, index|
p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end
更新
此正则表达式匹配更新后的 HTML
/Track It :\s*(?:<a href=".*">\s*(\d+)\s*<\/a>|$)/
匹配空曲目号和a link.
试试这个
data = page.css("body").text
data = data.gsub(" ","").split(/\n/)
po=[]
track=[]
data.each do |i|
if i.include? "PONumber"
po << i.split("PONumber:").last.scan(/\d+/)[0]
end
if i.include? "TrackIt"
track << i.split("TrackIt:").last
end
end
po.zip(track)