Perl XML::Twig 字符编码

Question

我有一组 XML 文件，其中包含非简单 ASCII 字符和编码字符的组合，例如：

... many 8-bit characters such as é, &#10906;, and ñ.

（第二个字符是 ⪚ 的 & 分号版本。第一个和第三个是未转义的字符。）

文件为 UTF-8 格式。

当我运行我的带有 XML::Twig 的 Perl 脚本时，实体（上面的第二个字符）变成了一个未知字符（我在写入文件时收到 'Wide character in print' 消息).

这是我的代码。处理程序所做的只是读取 XML，不进行任何更改：

 my $twig= XML::Twig->new( 
   comments => 'keep',
   output_encoding => 'UTF-8',
#   keep_encoding => 1,
   twig_handlers => { topicref => \&topicref_processing,
            xref => \&topicref_processing,
            link => \&topicref_processing},
      pretty_print => 'indented',

 );

 $twig->parsefile($file);
 my($outfile) = $file;
 $outfile =~ s/([.]dita)/.out/i;

open(NEW,">$outfile");
$twig->flush( \*NEW);
close(NEW);

如果我添加 keep_encoding => 1（上面已注释掉），实体会保留，但第一个和第三个字符会损坏：

...such as Ã©, &#10906;, and Ã±.

如果我在 flush 中添加 UTF-8 编码：

open(NEW,'>:encoding(UTF-8)', $outfile);

它变得更奇怪了：

...such as Ã?Â©, &#10906;, and Ã?Â±.

知道如何毫发无损地通过角色和实体吗？谢谢你，斯科特

Answer 1

除了确保您的输入和输出 IO 通道设置为 UTF-8 编码外，您不需要做任何特别的事情。 Wide character in print 警告表明您正在尝试将宽字符（大于 255 的代码点）打印到只有字节语义的通道

如果我用这个数据

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <text>... many 8-bit characters such as é, &#10906;, and ñ.</text>
</root>

使用下面的代码一切正常。关键是 use open qw/ :std :encoding(utf-8) /，它将 STDIN、STDOUT 和 STDERR 以及任何其他新打开的文件句柄设置为使用 UTF-8 编码

不幸的是 keep_encoding 选项似乎同时控制实体扩展和输出编码，我看不出有什么方法可以说服 XML::Twig 到 return 一个简单的字符串已启用，您所能得到的只是一个编码字节序列，您必须在其上调用 decode_utf8 以在将字符传递到编码输出通道之前取回字符。如果有人知道处理此问题的更好方法，那么我将不胜感激。当然，可以将编码后的数据从模块发送到 :raw 输出通道，但这不是正常工作的方式

请注意，要在输出中看到字符 ⪚，您必须使用具有该代码点字形的字体。大多数字体不会有那个字符

use strict;
use warnings;

use open qw/ :std :encoding(utf-8) /;

use XML::Twig ();
use Encode qw/ decode_utf8 /;

my $twig = XML::Twig->new( keep_encoding => 1 );
$twig->parsefile('utf-8.xml');

my ($text) = $twig->findnodes('/root/text');
$text = decode_utf8($text->trimmed_text);

print $text, "\n";

输出

... many 8-bit characters such as é, &#10906;, and ñ.

更新

这是为了解释您得到的输出

If I add keep_encoding => 1 (commented out above), the entity gets preserved, > but the first and third characters get corrupted:

...such as Ã©, ⪚, and Ã±.

这些字符没有损坏，文本输出为 UTF-8，但无论您使用什么来查看它都需要字节编码，例如 ISO -8859-1。当编码为 UTF-8 时，e-acute 字符 U+00E9 是一个双字节字符 0xC3 0xA9。当解释为 ISO-8859-1 时，0xC3 是波浪号，0xA9 是版权标志，这正是您所看到的。如果您使用预期的 UTF-8 编码数据，那么您将看到单个字符 e-acute

If I add UTF-8 encoding to the flush:

open(NEW,'>:encoding(UTF-8)', $outfile);

it gets even weirder:

...such as Ã?Â©, ⪚, and Ã?Â±.

这里发生的事情是，虽然来自 XML::Twig 的字符串已经编码为 UTF-8，但数据并未标记为如此。这意味着构成 UTF-8 编码字符的两个字节被视为单独的字符，并且它们被编码再次总共给出四个字符

Answer 2

第一件事：在你的情况下 keep_encoding 应该不被使用。这是一个古老的选项，可以追溯到远古时代，当时 latin1 是一种常用的编码，而 perl 与 unicode 的搭配不太好。我在这里说的是 5.8 之前的版本。该选项为生活在全拉丁语世界中的人们提供了一种处理 XML 而根本不必处理 unicode 的方法。将它与 utf-8 数据一起使用会导致疯狂（以及您发现的编码问题）。

如其他答案所述，输出文件需要以 utf8 模式打开，可以在 open 或通过 use utf8::all; 模式打开。这摆脱了 wide character 警告，并避免了输出被转换为 latin1 如果它只包含 ascii 和扩展 ascii 字符的更坏情况（perl 这样做是为了保持向后兼容性，如果删除⪚ 来自您的输入）。

完成此操作后，输出文件将采用正确的 utf-8 格式，未转义。如果显示不正常可能是你的终端不支持utf-8.

如果您需要转义所有非 ASCII 字符，可以使用 output_filter => 'safe' 选项，如下面的代码所示。

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use utf8::all; # either this or open the output file with '>:utf8'

my $file= 'test_enc.dita';

 my $twig= XML::Twig->new( 
   comments => 'keep',
   # escapes all non-ascii characters (including accented ones)
   output_filter => 'safe', 
   twig_handlers => { topicref => \&topicref_processing,
            xref => \&topicref_processing,
            link => \&topicref_processing},
      pretty_print => 'indented',

 );

 $twig->parsefile( $file);
 my($outfile) = $file;
 $outfile =~ s/([.]dita)/.out/i;

# current best practices recommend the  use the 3 args form of 
# open and lexical filehandles
open( my $out,'>', $outfile);
$twig->flush( $out);
close( $out);

除了 keep_encoding 之外，没有真正的方法可以忠实地保留字符的 encoded/non-encoded 形式，这是一种 hack。如果您确实需要将扩展的 ascii 字符保留为字符并将其他字符编码为数字字符实体，您将向 output_filter 提供一个自定义函数，它应该接收字符串（所有 utf-8 字符），并且 return 要输出的字符串（一些字符编码为数字实体）

也就是说，我不确定您是否需要忠实于原始格式。 XML 处理器不应该关心它。事实上，这就是为什么难以保留编码的原因：调用解析器的代码仅将文本视为 utf-8 字符串，所有实体都已解码。

Perl XML::Twig 字符编码

Perl XML::Twig character encoding

xml

perl

encoding

utf-8