Nokogiri Gem 不会使用 SAX 处理程序解析文件

Nokogiri Gem wont parse the file using SAX handler

我有 xml 个文件 header

<?xml version="1.0" encoding="utf-16"?>

并且它还包含

<transmission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

当使用 SAX 解析器时,它不会解析。但是当手动去掉编码部分和传输后的属性时;XML解析成功。由于文件很大;我只能使用 SAX.Is 来解析这个 xml 文件,而无需手动删除编码和传输属性。

示例代码是

      require 'nokogiri'
        include Nokogiri



class P < Nokogiri::XML::SAX::Document

      def initialize
      end

      def start_element(element, attributes = [])
        puts element
      end

      def cdata_block(string)
      end

      def characters(string)
      end

      def end_element(element)
        puts element
      end
 end

    parser = Nokogiri::XML::SAX::Parser.new(P.new())
    parser.parse_file('file_dummy.xml')

尝试实施 SAX 方法套件,看看您会得到什么:

require 'nokogiri'

class MyDoc < Nokogiri::XML::SAX::Document
  def cdata_block(str)
    puts "cdata_block: #{str}"
  end

  def characters(str)
    puts "characters: #{str}"
  end

  def comment(str)
    puts "comment: #{str}"
  end

  def end_element(str)
    puts "end_element: #{str}"
  end

  def end_document
    puts "end_document"
  end

  def end_element_namespace(name, prefix = nil, uri = nil)
    puts "end_element_namespace: name: #{name} prefix: #{prefix} uri: #{uri}"
  end

  def error(str)
    puts "error:#{str}"
  end

  def processing_instruction(name, content)
    puts "processing_instruction: name: #{name} content: #{content}"
  end

  def start_document
    puts "start_document"
  end

  def start_element(str, attrs = [])
    puts "start_element: #{str} attrs: #{attrs}"
  end

  def start_element_namespace(name, attrs=[], prefix=nil, uri=nil, ns=[])
    puts "start_element_namespace: name: #{name} attrs: #{attrs} prefix: #{prefix} uri: #{uri} ns: #{ns}"
  end

  def warning(str)
    puts "warning: #{str}"
  end

  def xmldecl(version, encoding, standalone)
    puts "xmldecl: version: #{version} encoding: #{encoding} standalone: #{standalone}"
  end
end

parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse(File.open(ARGV[0]))

将其保存到脚本并 运行 使用:

ruby path/to/script.rb path/to/file.xml

您应该会看到输出。例如,将以下内容用作简单的 XML 文件:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

我得到以下输出:

xmldecl: version: 1.0 encoding:  standalone:
start_document
start_element_namespace: name: catalog attrs: [] prefix:  uri:  ns: []
characters:

start_element_namespace: name: book attrs: [#<struct Nokogiri::XML::SAX::Parser::Attribute localname="id", prefix=nil, uri=nil, value="bk101">] prefix:  uri:  ns: []
characters:

start_element_namespace: name: author attrs: [] prefix:  uri:  ns: []
characters: Gambardella, Matthew
end_element_namespace: name: author prefix:  uri:
characters:

start_element_namespace: name: title attrs: [] prefix:  uri:  ns: []
characters: XML Developer's Guide
end_element_namespace: name: title prefix:  uri:
characters:

start_element_namespace: name: genre attrs: [] prefix:  uri:  ns: []
characters: Computer
end_element_namespace: name: genre prefix:  uri:
characters:

start_element_namespace: name: price attrs: [] prefix:  uri:  ns: []
characters: 44.95
end_element_namespace: name: price prefix:  uri:
characters:

start_element_namespace: name: publish_date attrs: [] prefix:  uri:  ns: []
characters: 2000-10-01
end_element_namespace: name: publish_date prefix:  uri:
characters:

start_element_namespace: name: description attrs: [] prefix:  uri:  ns: []
characters: An in-depth look at creating applications
      with XML.
end_element_namespace: name: description prefix:  uri:
characters:

end_element_namespace: name: book prefix:  uri:
characters:
end_element_namespace: name: catalog prefix:  uri:
end_document

经过无数次推荐。我得到了答案。是@thetinman.But的回答没有完全吸收。使用 sed 命令将 utf-16 替换为 utf-8 并解析文件。为什么我需要 sed 操作是 nokogiri 导致此 utf-16 问题