unicode 字符串的 Perl 内部表示
Perl internal representation of unicode string
我正在开发一个 perl + Mojolicious 网络应用程序,我的前端使用字符集 [=14] 发送一个 POST 查询,其中在 "a"
参数 ("été"
) 中包含重音符号=] 因为我可以在 chrome 网络选项卡中进行监视。但是服务器端脚本使用我没想到的字符集对该参数进行解码。
我编写了以下脚本来重现该案例。
use utf8; #script encoded in utf8 without bom
use Mojolicious::Lite;
use Data::HexDump;
{
require Mojolicious;
say "perl $^V, Mojolicious: v", Mojolicious->VERSION, ", ", `chcp` ;
}
post '/' => sub{
my $self = shift;
my $params = $self->req->params->to_hash;
app->log->debug("received data:\n", HexDump( $params->{a} ) );
use Devel::Peek;
Dump( $params->{a} );
$self->render( text => "ok for '$params->{a}'" );
};
if(my $pid = fork()){
use Mojo::UserAgent;
my $t = Mojo::UserAgent->new;
#simulate front-end query
my $tx = $t->post('http://127.0.0.1:3042/' =>
{ 'Content-Type' => 'application/x-www-form-urlencoded; charset=UTF-8' },
form => { a => 'été'}
);
my $res = $tx->res->body;
say "result:\n", HexDump($res);
use Devel::Peek;
Dump( $res );
kill 'SIGKILL', $pid;
exit(0);
}
app->start(qw(daemon --listen http://*:3042 ));
这个脚本的输出是:
perl v5.20.1, Mojolicious: v6.05, Page de codes active : 850
[Tue May 26 12:31:15 2015] [info] Listening at "http://*:3042"
Server available at http://127.0.0.1:3042
[Tue May 26 12:31:16 2015] [debug] Your secret passphrase needs to be changed
[Tue May 26 12:31:16 2015] [debug] POST "/"
[Tue May 26 12:31:16 2015] [debug] Routing to a callback
[Tue May 26 12:31:16 2015] [debug] received data:
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF
00000000 E9 74 E9 .t.
SV = PVMG(0x5a7a198) at 0x4dce730
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x5b62c48 "31t31"[=11=] [UTF8 "\x{e9}t\x{e9}"]
CUR = 5
LEN = 10
[Tue May 26 12:31:16 2015] [debug] 200 OK (0.005052s, 197.941/s)
result:
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF
00000000 6F 6B 20 66 6F 72 20 27 - C3 A9 74 C3 A9 27 ok for '..t..'
SV = PV(0x41a73e8) at 0x4927070
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK)
PV = 0x5aa1328 "ok for '31t31'"[=11=]
CUR = 14
LEN = 16
COW_REFCNT = 1
所以我们可以看到服务器在标记为 utf8
的字符串中接收 "a"
参数,其中包含缓冲区 "\x{e9}t\x{e9}"
.
我期待 "été"
六边形 "C3 A9 74 C3 A9"
。
怎么了?
更新:你的程序没有任何问题,你得到了你想要的 été,它只是被转储为 perl unicode 字符串 "\xE9t\xE9"
,它们是同一回事,perl unicode 字符串不是' t 在内存中存储为 utf8,它们从 utf 解码为 unicode codepoints/ordinals,utf8 只是 encode/represent unicode codepoints/ordinals 的一种方式
é 是序数 233,查看下面的维基百科 link(也更新了程序)
嗯,été在utf8中只有C3 A9 74 C3 A9
,在numbers/ordinals中été是233 116 233
作为 perl unicode 字符串是 \xE9t\xE9
,数字 233 是十六进制的 E9
更新:在我用编辑器创建utf8文件2之前,这里是用perl创建的。你可以看到它得到了你期望的正确字节,并且当你以 utf 或 raw
读取它时,dd 差异
$ perl -CS -e " print chr(233), chr(116), chr(233) " >2
$ od -tx1 2
0000000 c3 a9 74 c3 a9
0000005
$ type 2
été
$
$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_raw ) "
"\xC3\xA9t\xC3\xA9"
$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_utf8 ) "
"\xE9t\xE9"
$ perl -MData::Dump -MPath::Tiny -e " dd( map { [ $_, ord$_ ] } split //, path(2)->slurp_utf8 ) "
(["\xE9", 233], ["t", 116], ["\xE9", 233])
U+00E9
是 é 的代码点。 c3 a9
是UTF-8编码。要查看 'é'
的 UTF-8 编码形式,您需要对其进行 UTF-8 编码。例如:
#!/usr/bin/env perl -l
use utf8;
use strict;
use warnings;
use Unicode::UTF8 qw( encode_utf8 );
binmode STDOUT, ':encoding(UTF-8)';
my $é = "\x{e9}";
print $é;
printf "%v02x\n", encode_utf8($é);
输出:
$ ./u.pl
é
c3.a9
我正在开发一个 perl + Mojolicious 网络应用程序,我的前端使用字符集 [=14] 发送一个 POST 查询,其中在 "a"
参数 ("été"
) 中包含重音符号=] 因为我可以在 chrome 网络选项卡中进行监视。但是服务器端脚本使用我没想到的字符集对该参数进行解码。
我编写了以下脚本来重现该案例。
use utf8; #script encoded in utf8 without bom
use Mojolicious::Lite;
use Data::HexDump;
{
require Mojolicious;
say "perl $^V, Mojolicious: v", Mojolicious->VERSION, ", ", `chcp` ;
}
post '/' => sub{
my $self = shift;
my $params = $self->req->params->to_hash;
app->log->debug("received data:\n", HexDump( $params->{a} ) );
use Devel::Peek;
Dump( $params->{a} );
$self->render( text => "ok for '$params->{a}'" );
};
if(my $pid = fork()){
use Mojo::UserAgent;
my $t = Mojo::UserAgent->new;
#simulate front-end query
my $tx = $t->post('http://127.0.0.1:3042/' =>
{ 'Content-Type' => 'application/x-www-form-urlencoded; charset=UTF-8' },
form => { a => 'été'}
);
my $res = $tx->res->body;
say "result:\n", HexDump($res);
use Devel::Peek;
Dump( $res );
kill 'SIGKILL', $pid;
exit(0);
}
app->start(qw(daemon --listen http://*:3042 ));
这个脚本的输出是:
perl v5.20.1, Mojolicious: v6.05, Page de codes active : 850
[Tue May 26 12:31:15 2015] [info] Listening at "http://*:3042"
Server available at http://127.0.0.1:3042
[Tue May 26 12:31:16 2015] [debug] Your secret passphrase needs to be changed
[Tue May 26 12:31:16 2015] [debug] POST "/"
[Tue May 26 12:31:16 2015] [debug] Routing to a callback
[Tue May 26 12:31:16 2015] [debug] received data:
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF
00000000 E9 74 E9 .t.
SV = PVMG(0x5a7a198) at 0x4dce730
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x5b62c48 "31t31"[=11=] [UTF8 "\x{e9}t\x{e9}"]
CUR = 5
LEN = 10
[Tue May 26 12:31:16 2015] [debug] 200 OK (0.005052s, 197.941/s)
result:
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF
00000000 6F 6B 20 66 6F 72 20 27 - C3 A9 74 C3 A9 27 ok for '..t..'
SV = PV(0x41a73e8) at 0x4927070
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK)
PV = 0x5aa1328 "ok for '31t31'"[=11=]
CUR = 14
LEN = 16
COW_REFCNT = 1
所以我们可以看到服务器在标记为 utf8
的字符串中接收 "a"
参数,其中包含缓冲区 "\x{e9}t\x{e9}"
.
我期待 "été"
六边形 "C3 A9 74 C3 A9"
。
怎么了?
更新:你的程序没有任何问题,你得到了你想要的 été,它只是被转储为 perl unicode 字符串 "\xE9t\xE9"
,它们是同一回事,perl unicode 字符串不是' t 在内存中存储为 utf8,它们从 utf 解码为 unicode codepoints/ordinals,utf8 只是 encode/represent unicode codepoints/ordinals 的一种方式
é 是序数 233,查看下面的维基百科 link(也更新了程序)
嗯,été在utf8中只有C3 A9 74 C3 A9
,在numbers/ordinals中été是233 116 233
作为 perl unicode 字符串是 \xE9t\xE9
,数字 233 是十六进制的 E9
更新:在我用编辑器创建utf8文件2之前,这里是用perl创建的。你可以看到它得到了你期望的正确字节,并且当你以 utf 或 raw
读取它时,dd 差异$ perl -CS -e " print chr(233), chr(116), chr(233) " >2
$ od -tx1 2
0000000 c3 a9 74 c3 a9
0000005
$ type 2
été
$
$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_raw ) "
"\xC3\xA9t\xC3\xA9"
$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_utf8 ) "
"\xE9t\xE9"
$ perl -MData::Dump -MPath::Tiny -e " dd( map { [ $_, ord$_ ] } split //, path(2)->slurp_utf8 ) "
(["\xE9", 233], ["t", 116], ["\xE9", 233])
U+00E9
是 é 的代码点。 c3 a9
是UTF-8编码。要查看 'é'
的 UTF-8 编码形式,您需要对其进行 UTF-8 编码。例如:
#!/usr/bin/env perl -l
use utf8;
use strict;
use warnings;
use Unicode::UTF8 qw( encode_utf8 );
binmode STDOUT, ':encoding(UTF-8)';
my $é = "\x{e9}";
print $é;
printf "%v02x\n", encode_utf8($é);
输出:
$ ./u.pl é c3.a9