使用 Nokogiri 列出 HTML 中存在的标签
Using Nokogiri to list which tags are present in the HTML
我正在尝试使用 Nokogiri 计算 HTML 页面上的所有属性。假设我搜索 Google,如何使用 Nokogiri 计算来自该域源的每个 HTML 标签?
这是我的起点,没有产生我期望的结果:
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.whosebug.com/"))
@doc = Nokogiri::XML(doc)
@doc.xpath("//*")
像这样的东西可以满足您的需求:
require 'nokogiri'
require 'open-uri'
require 'awesome_print'
# Create a Nokogiri document
doc = Nokogiri::HTML(open("http://www.whosebug.com/").read)
# Iterate each node in the result set, and for each tag, increment the appropriate counter on the output hash
ap doc.xpath("//*").map(&:name).each_with_object({}) {|n, r| r[n] = (r[n] || 0) + 1 }
结果:
{
"html" => 1,
"head" => 1,
"title" => 1,
"link" => 5,
"meta" => 7,
"script" => 13,
"body" => 1,
"noscript" => 2,
"div" => 1429,
"h3" => 99,
"a" => 717,
"ul" => 5,
"li" => 89,
"span" => 490,
"form" => 1,
"input" => 1,
"br" => 4,
"b" => 3,
"ol" => 8,
"h1" => 1,
"img" => 9,
"h2" => 1,
"h4" => 1,
"table" => 1,
"tr" => 2,
"th" => 5,
"td" => 7
}
#name
是每个节点上具有标签名称的属性,因此我们只需将节点集缩减为输出哈希值。
我正在尝试使用 Nokogiri 计算 HTML 页面上的所有属性。假设我搜索 Google,如何使用 Nokogiri 计算来自该域源的每个 HTML 标签?
这是我的起点,没有产生我期望的结果:
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.whosebug.com/"))
@doc = Nokogiri::XML(doc)
@doc.xpath("//*")
像这样的东西可以满足您的需求:
require 'nokogiri'
require 'open-uri'
require 'awesome_print'
# Create a Nokogiri document
doc = Nokogiri::HTML(open("http://www.whosebug.com/").read)
# Iterate each node in the result set, and for each tag, increment the appropriate counter on the output hash
ap doc.xpath("//*").map(&:name).each_with_object({}) {|n, r| r[n] = (r[n] || 0) + 1 }
结果:
{
"html" => 1,
"head" => 1,
"title" => 1,
"link" => 5,
"meta" => 7,
"script" => 13,
"body" => 1,
"noscript" => 2,
"div" => 1429,
"h3" => 99,
"a" => 717,
"ul" => 5,
"li" => 89,
"span" => 490,
"form" => 1,
"input" => 1,
"br" => 4,
"b" => 3,
"ol" => 8,
"h1" => 1,
"img" => 9,
"h2" => 1,
"h4" => 1,
"table" => 1,
"tr" => 2,
"th" => 5,
"td" => 7
}
#name
是每个节点上具有标签名称的属性,因此我们只需将节点集缩减为输出哈希值。