使用 Nokogiri 提取文本保留链接

Question

如何在保留 <a> 标签的同时从 <p> 中提取文本

<p>
  Some <a href="http://somewhere.com">link</a> going somewhere.
  <ul>
    <li><a href="http://lowendbox.com/">Low end</a></li>
  </ul>
  Some trailing text.
</p>

预期输出：

Some <a href="http://somewhere.com">link</a> going somewhere.
<a href="http://lowendbox.com/">Low end</a>
Some trailing text.

我能想到的唯一解决方案是覆盖 Nokogiri text 方法并递归 children，希望有一些简单的解决方案。

Answer 1

你不能像那样在 p 中包含 ul，因此任何将其解析为 html4 或 html5 的尝试都会失败。剩下的就是正则表达式，它可以很容易地解决这个问题：

str = <<EOF
<p>
  Some <a href="http://somewhere.com">link</a> going somewhere.
  <ul>
    <li><a href="http://lowendbox.com/">Low end</a></li>
  </ul>
  Some trailing text.
</p>
EOF
puts str.gsub(/<\/?(p|ul|li)>/,'')

#  Some <a href="http://somewhere.com">link</a> going somewhere.
#
#    <a href="http://lowendbox.com/">Low end</a>
#
#  Some trailing text.

使用 Nokogiri 提取文本保留链接

Extract text retaining links using Nokogiri

ruby

mechanize

nokogiri

web-scraping