使用 Nokogiri 的 XML 标签内的无效响铃字符

Question

我正在使用 Nokogiri::XML::SAX::Document 解析由购物物品填充的 XML 文件。

其中一些项目有一个带有响铃字符的段落，没有 CDATA 块部分：

<description>Amazing product that will blow your mind. ^G Caution: may cause skin irritation and death.</description>

* ^G 是这个字符在 VIM.

中的显示方式

该元素的解析失败，然后出现以下错误：

XML document contains errors, check this: PCDATA invalid Char value 7.

有没有一种方法可以忽略 Nokogiri 中的无效字符来读取上面显示的元素？

Answer 1

这不是无效字符； : 在文本节点中完全有效。问题必须出在其他地方，可能是由于文档中的无效 XML 在解析文档时混淆了 libXML。

require 'nokogiri'

doc = Nokogiri::XML::DocumentFragment.parse('<description>Amazing product that will blow your mind. Caution: may cause skin irritation and death.</description>')
doc.to_xml # => "<description>Amazing product that will blow your mind. Caution: may cause skin irritation and death.</description>"
doc.errors # => []

doc.at('description').text # => "Amazing product that will blow your mind. Caution: may cause skin irritation and death."

要查看您的文档是否有效，请使用 errors 方法让 Nokogiri return 错误数组。在上面的代码中，它 return 是一个空数组，这意味着解析的内容没有任何问题。

...I discovered which character really is causing the problem...

<description>Amazing product that will blow your mind. ^G Caution: may cause skin irritation and death.</description>

您可以使用 tr 或 delete 在解析之前删除不需要的字符。不要在搜索字符串中使用 ^G，而是使用 \a，因为它是相同的值，只是更容易处理：

>> "^G".ord#=> 7
>> "\a".ord #=> 7

因此，您可以执行以下操作：

require 'nokogiri'

xml = "<description>Amazing product that will blow your mind. \a Caution: may cause skin irritation and death.</description>"
doc = Nokogiri::XML::DocumentFragment.parse(xml.delete("\a"))
doc.to_xml # => "<description>Amazing product that will blow your mind.  Caution: may cause skin irritation and death.</description>"

使用 Nokogiri 的 XML 标签内的无效响铃字符

Invalid bell character inside XML tag using Nokogiri

ruby

nokogiri