如何解码 ruby 中的 UTF-8 字符串？

Question

我正在解析 ruby (file.rb) 中的 xml 文件，但即使我将字符串编码为 UTF-8 或 "ISO-8859-1"。任何线索或我可以设置我的编码吗？ gist

require 'test/unit'
require 'nokogiri'

class MyTest < Test::Unit::TestCase

  def test_sentence
doc = Nokogiri::Slop <<-EOXML
<?xml version='1.0' encoding='utf-8'?>
<codeBook version="1.2.2" ID="klm-456-30">
     <var ID="V604" name="FHP_V145" wgt-var="K2" files="F1" dcml="0"
        intrvl="discrete">
          <qstn>
            <qstnLit>Dans quelle mesure cette aide vous a-t-elle
            &#195;&#169;t&#195;&#169; utile? &#195;&#8240;tait-elle
            :</qstnLit>
          </qstn>
    </var>
    <qstn>
</codeBook>
EOXML
  sentence = doc.children.css("[name=FHP_V145]").children.search("qstnLit").first.text.force_encoding("UTF-8").split("\n")
  sentence = sentence.map {|n| n.split.join(" ") }
  sentence = sentence.join(" ")
  puts sentence
    assert_equal(sentence, "Dans quelle mesure cette aide vous a-t-elle été utile? Était-elle :")
  end
end

Answer 1

XML 似乎已损坏。字符的实体应指定如下。

require 'test/unit'
require 'nokogiri'

class MyTest < Test::Unit::TestCase

  def test_sentence
doc = Nokogiri::Slop <<-EOXML
<?xml version='1.0' encoding='utf-8'?>
<codeBook version="1.2.2" ID="klm-456-30">
     <var ID="V604" name="FHP_V145" wgt-var="K2" files="F1" dcml="0"
       intrvl="discrete">
         <qstn>
           <qstnLit>Dans quelle mesure cette aide vous a-t-elle
           &#233;t&#233; utile? &#201;tait-elle
           :</qstnLit>
         </qstn>
     </var>
     <qstn>
</codeBook>
EOXML
  sentence = doc.children.css("[name=FHP_V145]").children.search("qstnLit").first.text.force_encoding("UTF-8").split("\n")
  sentence = sentence.map {|n| n.split.join(" ") }
  sentence = sentence.join(" ")
  puts sentence
    assert_equal(sentence, "Dans quelle mesure cette aide vous a-t-elle été utile? Était-elle :")
  end
end

如果您无法更正XML，您可以在阅读之前将这些实体替换为实际字符，如下所示。然而‰是不正确的。应该是‰

require 'test/unit'
require 'nokogiri'

class MyTest < Test::Unit::TestCase

  def test_sentence
doc = Nokogiri::Slop <<-EOXML.gsub(/\&#([^;]+);/){[.to_i].pack('c')}
<?xml version='1.0' encoding='utf-8'?>
<codeBook version="1.2.2" ID="klm-456-30">
     <var ID="V604" name="FHP_V145" wgt-var="K2" files="F1" dcml="0"
        intrvl="discrete">
          <qstn>
            <qstnLit>Dans quelle mesure cette aide vous a-t-elle
            &#195;&#169;t&#195;&#169; utile? &#195;&#137;tait-elle
            :</qstnLit>
          </qstn>
    </var>
    <qstn>
</codeBook>
EOXML

  sentence = doc.children.css("[name=FHP_V145]").children.search("qstnLit").first.text.force_encoding("ascii-8bit").split("\n")
  sentence = sentence.map {|n| n.split.join(" ") }
  sentence = sentence.join(" ")
  puts sentence
    assert_equal(sentence, "Dans quelle mesure cette aide vous a-t-elle été utile? Était-elle :")
  end
end

Answer 2

Nokogiri 尝试在编码方面做到最好。由于它遇到 ‰，显然是 promille sign，它确保输入文本是 UTF-8-ed。到目前为止，它是默认的 Nokogiri 编码。

如何解码 ruby 中的 UTF-8 字符串？

How to decode a string UTF-8 in ruby?

ruby

character-encoding

nokogiri