如何使用 Mechanize 在 <br> 之后提取文本

Question

我想在第一个<br>（状态）之后提取文本。

HTML代码是：

<div class="location">
    Country
    <br>
    State
    <br>
    City
</div>

目前我可以提取所有 <div> 文本：

a = Mechanize.new
page = a.get(url)
state = page.at('.location').text
puts state

有什么想法吗？

Answer 1

这很简单，但是您必须了解 DOM 中的 Nokogiri 中文档是如何表示的。

有标签，即元素节点，中间文本，即文本节点：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="location">
    Country
    <br>
    State
    <br>
    City
</div>
EOT

doc.at('.location br').next_sibling.text.strip # => "State"

这是 Nokogiri 所说的 <br> 是：

doc.at('.location br').class # => Nokogiri::XML::Element

以及以下文本节点：

doc.at('.location br').next_sibling.class # => Nokogiri::XML::Text

以及我们如何访问文本节点的内容：

doc.at('.location br').next_sibling.text # => "\n    State\n    "

再次查看 <div> 标签及其下一个兄弟节点：

doc.at('.location').class # => Nokogiri::XML::Element
doc.at('.location').next_sibling.class # => Nokogiri::XML::Text
doc.at('.location').next_sibling # => #<Nokogiri::XML::Text:0x3fcf58489c7c "\n">

顺便说一下，您可以访问 Mechanize 的 Nokogiri 解析器，使用类似 DOM 的方式来玩 DOM：

require 'mechanize'

agent = Mechanize.new  
page = agent.get('http://example.com')
doc = page.parser

doc.class # => Nokogiri::HTML::Document
doc.title # => "Example Domain"

I can't do like this doc.at('.location br br').next_sibling.text or doc.at('.location br').next_sibling.next_sibling.text

第一个断言是正确的，你不能使用'.location br br'因为你不能在<br>中嵌套标签，所以br br写[=是废话59=] HTML.

选择器

第二个说法是错误的。您可以使用 next_sibling.next_sibling 但您必须注意 DOM 中的标签。在您的 HTML 示例中，它 return 没有任何意义：

doc.at('.location br').to_html # => "<br>"
doc.at('.location br').next_sibling.to_html # => "\n    State\n    "
doc.at('.location br').next_sibling.next_sibling.to_html # => "<br>"

并且获取 <br> 的 text 将 return 一个空字符串，因为 <br> 无法换行文本：

doc.at('br').text # => ""

所以，您还不够深入：

doc.at('.location br').next_sibling.next_sibling.next_sibling.text.strip # => "City"

但是，如果这是 DOM 的意图，我会更简单地做到这一点：

break_text = doc.search('.location br').map{ |br| br.next_sibling.text.strip }
# => ["State", "City"]

Answer 2

尝试关注。

a = Mechanize.new
page = a.get(url)
state = page.search(".kiwii-no-link-color").children[2].text
puts state

如何使用 Mechanize 在 <br> 之后提取文本

How to extract text after <br> using Mechanize

ruby

mechanize