为什么我在解码某些 HTML 个实体而不解码其他实体时得到乱码输出？

Question

在 Perl 中，我尝试使用 HTML::Entities 解码包含数字 HTML 实体的字符串。有些实体有效，而 "newer" 个实体无效。例如：

decode_entities('&#174;');  # returns ® as expected
decode_entities('&#8486;'); # returns â„¦ instead of Ω
decode_entities('&#9733;'); # returns â˜… instead of ★

有没有办法在 Perl 中解码这些 "newer" HTML 实体？在 PHP 中，html_entity_decode 函数似乎可以毫无问题地解码所有这些实体。

Answer 1

解码正常。这就是你输出它们的方式是错误的。例如，您可能已将字符串发送到终端，而没有先为该终端编码它们。这是通过以下程序中的 open pragma 实现的：

$ perl -e'
    use open ":std", ":encoding(UTF-8)";
    use HTML::Entities qw( decode_entities );
    CORE::say decode_entities($_)
       for "&#174;", "&#8486;", "&#9733;";
'
®
Ω
★

Answer 2

确保您的终端可以处理 UTF-8 编码。看起来多字节字符有问题。您也可以尝试为 STDOUT 设置 UTF-8，以防出现宽字符警告。

use strict;
use warnings;
use HTML::Entities;

binmode STDOUT, ':encoding(UTF-8)';

print decode_entities('&#174;');  # returns ®
print decode_entities('&#8486;'); # returns Ω
print decode_entities('&#9733;'); # returns ★

这给了我 correct/expected 个结果。

为什么我在解码某些 HTML 个实体而不解码其他实体时得到乱码输出？

Why do I get garbled output when I decode some HTML entities but not others?

perl

decode

html-entities