Nokogiri Gem 不会使用 SAX 处理程序解析文件
Nokogiri Gem wont parse the file using SAX handler
我有 xml 个文件 header
<?xml version="1.0" encoding="utf-16"?>
并且它还包含
<transmission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
当使用 SAX 解析器时,它不会解析。但是当手动去掉编码部分和传输后的属性时;XML解析成功。由于文件很大;我只能使用 SAX.Is 来解析这个 xml 文件,而无需手动删除编码和传输属性。
示例代码是
require 'nokogiri'
include Nokogiri
class P < Nokogiri::XML::SAX::Document
def initialize
end
def start_element(element, attributes = [])
puts element
end
def cdata_block(string)
end
def characters(string)
end
def end_element(element)
puts element
end
end
parser = Nokogiri::XML::SAX::Parser.new(P.new())
parser.parse_file('file_dummy.xml')
尝试实施 SAX 方法套件,看看您会得到什么:
require 'nokogiri'
class MyDoc < Nokogiri::XML::SAX::Document
def cdata_block(str)
puts "cdata_block: #{str}"
end
def characters(str)
puts "characters: #{str}"
end
def comment(str)
puts "comment: #{str}"
end
def end_element(str)
puts "end_element: #{str}"
end
def end_document
puts "end_document"
end
def end_element_namespace(name, prefix = nil, uri = nil)
puts "end_element_namespace: name: #{name} prefix: #{prefix} uri: #{uri}"
end
def error(str)
puts "error:#{str}"
end
def processing_instruction(name, content)
puts "processing_instruction: name: #{name} content: #{content}"
end
def start_document
puts "start_document"
end
def start_element(str, attrs = [])
puts "start_element: #{str} attrs: #{attrs}"
end
def start_element_namespace(name, attrs=[], prefix=nil, uri=nil, ns=[])
puts "start_element_namespace: name: #{name} attrs: #{attrs} prefix: #{prefix} uri: #{uri} ns: #{ns}"
end
def warning(str)
puts "warning: #{str}"
end
def xmldecl(version, encoding, standalone)
puts "xmldecl: version: #{version} encoding: #{encoding} standalone: #{standalone}"
end
end
parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse(File.open(ARGV[0]))
将其保存到脚本并 运行 使用:
ruby path/to/script.rb path/to/file.xml
您应该会看到输出。例如,将以下内容用作简单的 XML 文件:
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
我得到以下输出:
xmldecl: version: 1.0 encoding: standalone:
start_document
start_element_namespace: name: catalog attrs: [] prefix: uri: ns: []
characters:
start_element_namespace: name: book attrs: [#<struct Nokogiri::XML::SAX::Parser::Attribute localname="id", prefix=nil, uri=nil, value="bk101">] prefix: uri: ns: []
characters:
start_element_namespace: name: author attrs: [] prefix: uri: ns: []
characters: Gambardella, Matthew
end_element_namespace: name: author prefix: uri:
characters:
start_element_namespace: name: title attrs: [] prefix: uri: ns: []
characters: XML Developer's Guide
end_element_namespace: name: title prefix: uri:
characters:
start_element_namespace: name: genre attrs: [] prefix: uri: ns: []
characters: Computer
end_element_namespace: name: genre prefix: uri:
characters:
start_element_namespace: name: price attrs: [] prefix: uri: ns: []
characters: 44.95
end_element_namespace: name: price prefix: uri:
characters:
start_element_namespace: name: publish_date attrs: [] prefix: uri: ns: []
characters: 2000-10-01
end_element_namespace: name: publish_date prefix: uri:
characters:
start_element_namespace: name: description attrs: [] prefix: uri: ns: []
characters: An in-depth look at creating applications
with XML.
end_element_namespace: name: description prefix: uri:
characters:
end_element_namespace: name: book prefix: uri:
characters:
end_element_namespace: name: catalog prefix: uri:
end_document
经过无数次推荐。我得到了答案。是@thetinman.But的回答没有完全吸收。使用 sed 命令将 utf-16 替换为 utf-8 并解析文件。为什么我需要 sed 操作是 nokogiri 导致此 utf-16 问题
我有 xml 个文件 header
<?xml version="1.0" encoding="utf-16"?>
并且它还包含
<transmission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
当使用 SAX 解析器时,它不会解析。但是当手动去掉编码部分和传输后的属性时;XML解析成功。由于文件很大;我只能使用 SAX.Is 来解析这个 xml 文件,而无需手动删除编码和传输属性。
示例代码是
require 'nokogiri'
include Nokogiri
class P < Nokogiri::XML::SAX::Document
def initialize
end
def start_element(element, attributes = [])
puts element
end
def cdata_block(string)
end
def characters(string)
end
def end_element(element)
puts element
end
end
parser = Nokogiri::XML::SAX::Parser.new(P.new())
parser.parse_file('file_dummy.xml')
尝试实施 SAX 方法套件,看看您会得到什么:
require 'nokogiri'
class MyDoc < Nokogiri::XML::SAX::Document
def cdata_block(str)
puts "cdata_block: #{str}"
end
def characters(str)
puts "characters: #{str}"
end
def comment(str)
puts "comment: #{str}"
end
def end_element(str)
puts "end_element: #{str}"
end
def end_document
puts "end_document"
end
def end_element_namespace(name, prefix = nil, uri = nil)
puts "end_element_namespace: name: #{name} prefix: #{prefix} uri: #{uri}"
end
def error(str)
puts "error:#{str}"
end
def processing_instruction(name, content)
puts "processing_instruction: name: #{name} content: #{content}"
end
def start_document
puts "start_document"
end
def start_element(str, attrs = [])
puts "start_element: #{str} attrs: #{attrs}"
end
def start_element_namespace(name, attrs=[], prefix=nil, uri=nil, ns=[])
puts "start_element_namespace: name: #{name} attrs: #{attrs} prefix: #{prefix} uri: #{uri} ns: #{ns}"
end
def warning(str)
puts "warning: #{str}"
end
def xmldecl(version, encoding, standalone)
puts "xmldecl: version: #{version} encoding: #{encoding} standalone: #{standalone}"
end
end
parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse(File.open(ARGV[0]))
将其保存到脚本并 运行 使用:
ruby path/to/script.rb path/to/file.xml
您应该会看到输出。例如,将以下内容用作简单的 XML 文件:
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
我得到以下输出:
xmldecl: version: 1.0 encoding: standalone:
start_document
start_element_namespace: name: catalog attrs: [] prefix: uri: ns: []
characters:
start_element_namespace: name: book attrs: [#<struct Nokogiri::XML::SAX::Parser::Attribute localname="id", prefix=nil, uri=nil, value="bk101">] prefix: uri: ns: []
characters:
start_element_namespace: name: author attrs: [] prefix: uri: ns: []
characters: Gambardella, Matthew
end_element_namespace: name: author prefix: uri:
characters:
start_element_namespace: name: title attrs: [] prefix: uri: ns: []
characters: XML Developer's Guide
end_element_namespace: name: title prefix: uri:
characters:
start_element_namespace: name: genre attrs: [] prefix: uri: ns: []
characters: Computer
end_element_namespace: name: genre prefix: uri:
characters:
start_element_namespace: name: price attrs: [] prefix: uri: ns: []
characters: 44.95
end_element_namespace: name: price prefix: uri:
characters:
start_element_namespace: name: publish_date attrs: [] prefix: uri: ns: []
characters: 2000-10-01
end_element_namespace: name: publish_date prefix: uri:
characters:
start_element_namespace: name: description attrs: [] prefix: uri: ns: []
characters: An in-depth look at creating applications
with XML.
end_element_namespace: name: description prefix: uri:
characters:
end_element_namespace: name: book prefix: uri:
characters:
end_element_namespace: name: catalog prefix: uri:
end_document
经过无数次推荐。我得到了答案。是@thetinman.But的回答没有完全吸收。使用 sed 命令将 utf-16 替换为 utf-8 并解析文件。为什么我需要 sed 操作是 nokogiri 导致此 utf-16 问题