HTML 实体解码与 BIG5 字符混合，使用 Perl 转换为 UTF-8

Question

我的Perl脚本和数据输入文件是BIG5中文编码。

字符串数据包含 HTML 个实体，例如。日语字符

在浏览器上查看结果完美显示。

但是为了进一步的数据操作，我需要将它们全部转换成 UTF-8

例如

来自BIG5编码

一と三

转UTF-8编码

一と三

这是我试过的代码：

#!/usr/local/bin/perl

use Encode qw/encode decode/;
use HTML::Entities;

print "Content-type: text/html\n\n";

$str = "&#12392;";
$str = encode('utf8', decode("big5",$str));
print "$str\n";
decode_entities($str);
print "$str\n";

$str2 = "一&#12392;三";
$str2 = encode('utf8', decode("big5",$str2));
print "$str2\n";
decode_entities($str2); # where the issue is
print "$str2\n";

这是运行上述代码后的结果。

&#12392;
と
一&#12392;三
ä¸とä¸

请注意脚本本身也保存为 BIG5 编码。

在 decode_entities($str2); 之后，它似乎也在尝试解码 UTF-8 中的汉字，这就是导致问题的原因。

如何解决这个问题？或者限制为 decode_entities() 仅适用于 &xxxxx; 模式？

Answer 1

问题是您将输出 utf8 字符串（utf8::is_utf8 returns true）的 decode_entities 与原始字符串（utf8::is_utf8 returns false ) 由可以解释为 utf8 的八位字节流组成。相反，您应该组合原始字符串或 utf8 字符串。

以下工作首先将您的字符串从 big5 编码为 utf8 字符串，然后替换 HTML 编码，最后将所有内容转换为表示 utf8 字符的原始字符串：

$str2 = "一&#12392;三";
$str2 = decode("big5",$str2);  # big5 to internal utf8 -> utf8::is_utf8($str2) is true
decode_entities($str2);        # decode HTML entities
$str2 = encode('utf8',$str2);  # internal utf8 to raw bytes, utf8::is_utf8($str2) is false

HTML 实体解码与 BIG5 字符混合，使用 Perl 转换为 UTF-8

HTML Entity decoding mixing with BIG5 characters, convert to UTF-8 using Perl

perl

html-encode

character-encoding

html-entities

chinese-locale