如何进一步处理Nokogiri::XML::Element?
How to further process a Nokogiri::XML::Element?
我在 Ruby 中编写了一个简短的脚本,使用 Nokogiri 从网页中提取一些数据。该脚本工作正常,但它目前将多个嵌套标签作为单个 Nokogiri::XML::Element.
返回
脚本如下:
require 'rubygems'
require 'nokogiri'
#some dummy content that mimics the structure of the web page
dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)
#grab the second div inside of the div entitled div_saadi
result = page.css('div#div_saadi div')[1]
puts result
puts result.class
输出如下:
<div style="padding:10px 0">
<span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span
</div>
Nokogiri::XML::Element
我想做的是产生以下输出(使用类似 .each 的东西):
content
content outside of the span
morecontent
morecontent outside of the span
你越来越接近了,但不明白你得到了什么。
根据 HTML 标签,您可以获得嵌入的标签。这就是正在发生的事情:您要求的是单个节点,但它包含其他节点:
puts page.css('div#div_saadi div')[1].to_html
# >> <div style="padding:10px 0">
# >> <span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div>
text
适用于 NodeSet 和 Node。它只是抓取你指向的任何文本并 returns 它并不关心它必须下降多少级别才能做到这一点:
result = page.css('div#div_saadi div')[1].text
# => "contentcontent outside of the spanmorecontentmorecontent outside of the span"
相反,您必须遍历各个嵌入节点并提取它们的文本:
require 'nokogiri'
dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)
result = page.css('div#div_saadi div')[1]
puts result.children.map(&:text)
# >> content
# >> content outside of the span
# >> morecontent
# >> morecontent outside of the span
children
returns 所有嵌入节点作为一个 NodeSet。遍历那个 returns 节点,并在那个时候在特定节点上使用 text
将 return 你想要的。
我在 Ruby 中编写了一个简短的脚本,使用 Nokogiri 从网页中提取一些数据。该脚本工作正常,但它目前将多个嵌套标签作为单个 Nokogiri::XML::Element.
返回脚本如下:
require 'rubygems'
require 'nokogiri'
#some dummy content that mimics the structure of the web page
dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)
#grab the second div inside of the div entitled div_saadi
result = page.css('div#div_saadi div')[1]
puts result
puts result.class
输出如下:
<div style="padding:10px 0">
<span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span
</div>
Nokogiri::XML::Element
我想做的是产生以下输出(使用类似 .each 的东西):
content
content outside of the span
morecontent
morecontent outside of the span
你越来越接近了,但不明白你得到了什么。
根据 HTML 标签,您可以获得嵌入的标签。这就是正在发生的事情:您要求的是单个节点,但它包含其他节点:
puts page.css('div#div_saadi div')[1].to_html
# >> <div style="padding:10px 0">
# >> <span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div>
text
适用于 NodeSet 和 Node。它只是抓取你指向的任何文本并 returns 它并不关心它必须下降多少级别才能做到这一点:
result = page.css('div#div_saadi div')[1].text
# => "contentcontent outside of the spanmorecontentmorecontent outside of the span"
相反,您必须遍历各个嵌入节点并提取它们的文本:
require 'nokogiri'
dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)
result = page.css('div#div_saadi div')[1]
puts result.children.map(&:text)
# >> content
# >> content outside of the span
# >> morecontent
# >> morecontent outside of the span
children
returns 所有嵌入节点作为一个 NodeSet。遍历那个 returns 节点,并在那个时候在特定节点上使用 text
将 return 你想要的。