Perl:将字节序列打包成字符串

Perl: Packing a sequence of bytes into a string

我正在尝试 运行 一个简单的测试,借此我想要不同格式的二进制字符串并将它们打印出来。事实上,我正在尝试调查 sprintf 无法处理为占位符 %s.

传入的宽字符字符串的问题

在这种情况下,二进制字符串应只包含西里尔字母“д”(因为它高于 ISO-8859-1)

当我直接在源代码中使用字符时,下面的代码有效。

但是通过 pack 的任何东西都不起作用。

代码:

#!/usr/bin/perl

use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"

# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

sub showme {
   my ($name,$ch) = @_;
   print "-------\n";
   print "This is test: $name\n";

   my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint

   {
      # https://perldoc.perl.org/bytes.html
      use bytes;
      my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
      my $txt  = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
      print $txt,"\n";
   }

   print $ch, "\n";
   print "Combine: $ch\n";
   print "Concat: " . $ch . "\n";
   print "Sprintf: " . sprintf("%s",$ch) . "\n";
   print "-------\n";
}


showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8"  , pack("HH","D0","B4"));  # UTF-8 of д is D0B4
showme("Cyrillic UCS-2"  , pack("HH","04","34"));  # UCS-2 of д is 0434

当前输出:

看起来不错

-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes

д
Combine: д
Concat: д
Sprintf: д
-------

不行。 176从哪里来??

-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no

а
Combine: а
Concat: а
Sprintf: а
-------

这更糟。

-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no

0
Combine: 0
Concat: 0
Sprintf: 0
-------

你有两个问题。


您对 pack 的调用不正确

每个 H 代表一个十六进制数字。

$ perl -e'printf "%vX\n", pack("HH", "D0", "B4")'       # XXX
D0.B0

$ perl -e'printf "%vX\n", pack("H2H2", "D0", "B4")'     # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)2", "D0", "B4")'    # Ok
D0.B4

$ perl -e'printf "%vX\n", pack("(H2)*", "D0", "B4")'    # Better
D0.B4

$ perl -e'printf "%vX\n", pack("H*", "D0B4")'           # Alternative
D0.B4

STDOUT 需要解码文本,但您提供的是编码文本

首先,让我们看一下您正在生成的字符串(一旦解决了上述问题)。您所需要的只是 %vX 格式,它以十六进制形式提供每个字符的句点分隔值。

  • "д" 生成单字符字符串。此字符是 д.

    的 Unicode 代码点
    $ perl -e'use utf8; printf("%vX\n", "д");'
    434
    
  • pack("H*", "D0B4") 生成两个字符的字符串。这些字符是д.

    的UTF-8编码
    $ perl -e'printf("%vX\n", pack("H*", "D0B4"));'
    D0.B4
    
  • pack("H*", "0434") 生成两个字符的字符串。这些字符是 д.

    的 UCS-2be 和 UTF-16be 编码
    $ perl -e'printf("%vX\n", pack("H*", "0434"));'
    4.34
    

通常,文件句柄需要打印一串字节(值为 0..255 的字符)。这些字节是逐字输出的。[1][2]

当一个编码层(例如 :encoding(UTF-8))被添加到一个文件句柄时,它期望将一串 Unicode 代码点(又名解码文本)打印到它。

您的程序向 STDOUT 添加了一个编码层(通过使用 use open pragma),因此您必须向 print 和 [=30 提供 UCP(解码文本) =].例如,您可以使用 Encode 的 decode 函数从编码文本中获取解码文本。

use utf8;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

use Encode qw( decode );

say "д";                   # ok  (UCP of "д")
say pack("H*", "D0B4");    # XXX (UTF-8 encoding of "д")
say pack("H*", "0434");    # XXX (UCS-2be and UTF-16be encoding of "д")

say decode("UTF-8",    pack("H*", "D0B4"));   # ok (UCP of "д")
say decode("UCS-2be",  pack("H*", "0434"));   # ok (UCP of "д")
say decode("UTF-16be", pack("H*", "0434"));   # ok (UCP of "д")

For the UTF-8 case, I need to set the UTF-8 flag on

不,您需要对字符串进行解码。

UTF-8 标志无关紧要。最初是否设置标志是无关紧要的。字符串解码后是否设置标志是无关紧要的。该标志指示字符串在内部如何存储,这是您不应该关心的。

例如取

use strict;
use warnings;
use open qw( :std :encoding(UTF-8) );
use feature qw( say );

my $x = chr(0xE9);

utf8::downgrade($x);   # Tell Perl to use the UTF8=0 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

utf8::upgrade($x);   # Tell Perl to use the UTF8=1 storage format.
say sprintf "%s %vX %s", utf8::is_utf8($x) ? "UTF8=1" : "UTF8=0", $x, $x;

输出

UTF8=0 E9 é
UTF8=1 E9 é

不考虑UTF8标志,输出提供的UCP(U+00E9)的UTF-8编码(C3 A9)。


I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?

充其量,人们可以使用试探法来猜测字符串是使用 iso-latin-1 还是 UCS-2be 编码的。我怀疑可以得到相当准确的结果(比如 those 你会得到 iso-latin-1 和 UTF-8。)

我不确定你为什么要提到 iso-latin-1,因为你的问题中没有其他内容与 iso-latin-1 相关。


  1. 除了 Windows,其中默认添加了一个 :crlf 图层到句柄。

  2. 如果您提供的字符串包含非字节字符,并且输出字符串的 utf8 编码,则会收到 Wide character 警告。

请看下面的演示代码是否有帮助

use strict;
use warnings;
use feature 'say';

use utf8;     # https://perldoc.perl.org/utf8.html
use Encode;   # https://perldoc.perl.org/Encode.html

my $str;

my $utf8   = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

# https://perldoc.perl.org/functions/binmode.html

binmode STDOUT, ':utf8'; 

# https://perldoc.perl.org/feature.html#The-'say'-feature

say 'UTF-8:   ' . $utf8;  

# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API

$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);  

$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);

$str = pack('H*',$utf16);
say 'UTF-16:  '. decode('UTF16',$str);

$str = pack('H*',$utf32);
say 'UTF-32:  ' . decode('UTF32',$str);

输出

UTF-8:   Привет Москва
UCS-2BE: Привет Москва
UCS-2LE: Привет Москва
UTF-16:  Привет Москва
UTF-32:  Привет Москва

支持的西里尔编码

use strict;
use warnings;
use feature 'say';

use Encode;
use utf8;

binmode STDOUT, ':utf8';

my $utf8 = 'Привет Москва';
my @encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;

say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       ', $utf8;

for (@encodings) {
    printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}

输出

:: Supported Cyrillic encoding
---------------------------------------------
UTF-8       Привет Москва
UCS-2       041f044004380432043504420020041c043e0441043a04320430
UCS-2LE     1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE     041f044004380432043504420020041c043e0441043a04320430
UTF-16      feff041f044004380432043504420020041c043e0441043a04320430
UTF-32      0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5  bfe0d8d2d5e220bcdee1dad2d0
CP855       dde1b7eba8e520d3d6e3c6eba0
CP1251      cff0e8e2e5f220cceef1eae2e0
KOI8-F      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R      f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U      f0d2c9d7c5d420edcfd3cbd7c1

文档 Encode::Supported

两个都是很好的答案。下面是 Polar Bear 代码的轻微扩展,用于打印有关字符串的详细信息:

use strict;
use warnings;
use feature 'say';

use utf8;
use Encode;

sub about {
   my($str) = @_;
   # https://perldoc.perl.org/bytes.html
   my $charlen = length($str);
   my $txt;
   {
      use bytes;
      my $mark = (utf8::is_utf8($str) ? "yes" : "no");
      my $bytelen = length($str);
      $txt  = sprintf("Length: %d byte, %d chars, utf-8: %s, contents: %vd\n", 
                      $bytelen,$charlen,$mark,$str);
   }
   return $txt;
}

my $str;

my $utf8   = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';

binmode STDOUT, ':utf8';

say 'UTF-8:   ' . $utf8;
say about($utf8);

{
   my $str = pack('H*',$ucs2be);
   say 'UCS-2BE: ' . decode('UCS-2BE',$str);
   say about($str);
}

{
   my $str = pack('H*',$ucs2le);
   say 'UCS-2LE: ' . decode('UCS-2LE',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf16);
   say 'UTF-16:  '. decode('UTF16',$str);
   say about($str);
}

{
   my $str = pack('H*',$utf32);
   say  'UTF-32:  ' . decode('UTF32',$str);
   say about($str);
}

# Try identity transcoding

{
   my $str_encoded_in_utf16 = encode('UTF16',$utf8);
   my $str = decode('UTF16',$str_encoded_in_utf16);
   say 'The same: ' . $str;
   say about($str);
}

运行 这给出:

UTF-8:   Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

UCS-2BE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UCS-2LE: Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 31.4.64.4.56.4.50.4.53.4.66.4.32.0.28.4.62.4.65.4.58.4.50.4.48.4

UTF-16:  Привет Москва
Length: 26 byte, 26 chars, utf-8: no, contents: 4.31.4.64.4.56.4.50.4.53.4.66.0.32.4.28.4.62.4.65.4.58.4.50.4.48

UTF-32:  Привет Москва
Length: 52 byte, 52 chars, utf-8: no, contents: 0.0.4.31.0.0.4.64.0.0.4.56.0.0.4.50.0.0.4.53.0.0.4.66.0.0.0.32.0.0.4.28.0.0.4.62.0.0.4.65.0.0.4.58.0.0.4.50.0.0.4.48

The same: Привет Москва
Length: 25 byte, 13 chars, utf-8: yes, contents: 208.159.209.128.208.184.208.178.208.181.209.130.32.208.156.208.190.209.129.208.186.208.178.208.176

我做了一个小图作为下次的概述,涵盖 encodedecodepack。因为最好为下一次做好准备。

(上图及其 graphml 文件可用 here