如何使用SAX获取CDATA内容

Question

我正在尝试解析一个大 XML 文件以获取所有外部 XML 标记内容，如下所示：

<string name="key"><![CDATA[Hey I'm a tag with & and other characters]]></string>

得到这个：

<![CDATA[Hey I'm a tag with & and other characters]]>

虽然，当我使用 Nokogiri 的 SAX XML 解析器时，我只得到没有 CDATA 且字符转义的文本，如下所示：

Hey I\'m a tag with &amp; and other characters

这是我的代码：

  class IDCollector < Nokogiri::XML::SAX::Document
    def initialize
    end

    def characters string
        puts string # this does not works, CDATA tag is not printed  
    end

    def cdata_block string
      puts string
      puts "<![CDATA[" + string + "]]>"
    end
  end

有没有办法用 Nokogiri SAX 做到这一点？

Answer 1

查看文档一段时间后，我认为这只能通过在 Nokogiri 的帮助下构建新的 CDATA 内容来实现，如下所示：

  tmp = Nokogiri::XML::Document.new
  value = tmp.create_cdata(value)
  r = doc.at_xpath(PATH_TO_REPLACE)
  r.inner_html = value

Answer 2

不清楚您要做什么，但这可能有助于澄清问题。

<![CDATA[...]]> 条目不是标记，而是块，解析器对其进行不同的处理。当遇到块时，<![CDATA[ 和 ]]> 被剥离，所以您只会看到里面的字符串。有关详细信息，请参阅“What does <![CDATA[]]> in XML mean?”。

如果您尝试在 XML 中创建 CDATA 块，可以使用以下方法轻松完成：

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string') << Nokogiri::XML::CDATA.new(Nokogiri::XML::Document.new, "Hey I'm a tag with & and other characters")
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\"><![CDATA[Hey I'm a tag with & and other characters]]></string>\n"

<<就是shorthand创建子节点

尝试使用 inner_html 不会执行您想要的操作，因为它会创建一个文本节点作为子节点：

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string').inner_html = "Hey I'm a tag with & and other characters"
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\">Hey I'm a tag with &amp; and other characters</string>\n"
doc.at('string').children.first.text # => "Hey I'm a tag with & and other characters"
doc.at('string').children.first.class # => Nokogiri::XML::Text

使用 inner_html 会导致对字符串进行 HTML 编码，这是嵌入可能包含标签的文本的替代方法。如果没有编码或使用 CDATA，XML 解析器可能会对什么是文本和什么是真实标签感到困惑。我写过 RSS 聚合器，必须处理提要中嵌入的错误编码 HTML 是一件痛苦的事。

如何使用SAX获取CDATA内容

How to use SAX to get CDATA content

ruby

xml

sax

nokogiri