使用 Perl 中的 `printf`、`sprintf` 函数对齐 UTF-8 中的字符串

Aligning strings in the UTF-8 using the `printf`, `sprintf` functions in Perl

我有一个Perl脚本,其任务是对齐UTF8编码的字符串并写入文件,部分脚本如下:

#!/usr/bin/perl
use strict;
use utf8;
use locale;
use warnings;
...
my $length_sv = 9;
open my $out, '>>:encoding(UTF-8)', "filename" or warn "Could not open file - $!" and exit(1);
my ($tid, $cid, $v3, $l, $v5, $sub) = $_ =~ /^\{"id":(\d+),"customer_id":(\d+)(.*?)_login":"(\w{1,10})"(.*?)"subject":"(.*?)"/;
my $subc = substr($sub, 0, $length_sv);
say $subc;
my $string = sprintf "| %-5s | %-1s | %-9s | %-${length_sv}s | %-11s | %-10s|","$time","$num","$tid","$subc","$cid","$l";
say $string;
say $out $string;
close $out;

在运行脚本之后,在STDOUT中我们得到以下结论:

Тест Mark
| 11:00 | 1 | 1234567   | Тест Mark | 10101012      | login   |

但是同一行写入文件出错:

$ cat filename
| 11:00 | 1 | 1234567   | ТеÑÑ Mark | 10101012      | login   |

我想写入第 Тест Mark 列,但文件中添加了第 ТеÑÑ Mark 列。

我试着在脚本中添加这样一行:

binmode($out,':utf8');

不幸的是,它没有帮助。你怎么解决这个问题?

我对你的错误的建议:

#!/usr/bin/env perl

use strict;
use warnings;
use utf8;
use v5.10;

open my $fh0, '>:encoding(UTF-8)', './russian_text' or die $!;

print $fh0 'Тест';

close( $fh0 );

say __LINE__, ': ', `cat ./russian_text`;

foreach my $mode ( '<', '<:encoding(UTF-8)' ) {
    open my $fh1, $mode, './russian_text' or die $!;
    my $line = <$fh1>;
    chomp $line;

    open my $fh2, '>:encoding(UTF-8)', './russian_out'  or die $!;
    print $fh2 "Mode: $mode, line: ", $line;
    close( $fh2 );
    say `cat ./russian_out`;
}

来自perlopentut

But never use the bare "<" without having set up a default encoding first. Otherwise, Perl cannot know which of the many, many, many possible flavors of text file you have, and Perl will have no idea how to correctly map the data in your file into actual characters it can work with. Other common encoding formats including "ASCII", "ISO-8859-1", "ISO-8859-15", "Windows-1252", "MacRoman", and even "UTF-16LE". See perlunitut for more about encodings.

如果您的数据来自其他地方 - 您仍然应该正确解码数据。 Perl 无法自动计算编码:

#!/usr/bin/env perl

use strict;
use warnings;
use Encode qw(decode);

# onliner for test server: perl -Mojo -E 'a(sub ($c) { $c->render(text => "Тест") })->start' daemon

my $web_data = `curl localhost:3000`;

my $decoded = decode('UTF-8', $web_data );

open my $fh0, '>:encoding(UTF-8)', 'foo';
open my $fh1, '>:encoding(UTF-8)', 'bar';

print $fh0 $web_data;
print $fh1 $decoded;

在此处了解 encode/decode:perlunitut

解码您的输入并编码您的输出。

错误 #1:$sub 因此 $subc 包含使用 UTF-8 编码的文本,但打印到具有编码层的文件句柄需要解码文本。结果是您最终在文件中得到“双重编码”的文本。您需要解码您的输入。

Bug #2:修复第一个 Bug 会发现另一个 Bug。您向文件句柄添加了编码层,但未向 STDOUT 添加。要解决此问题,请添加一个编码层来解码您的 STDOUT。

固定版本:

# Adds an encoding/decoding layer to STDIN, STDOUT and STDERR.
# Sets the default :encoding for handles opened in scope (incl via ARGV).
use open ':std', ':encoding(UTF-8)';

use JSON qw( from_json );

# Same as `open(my $fh, '>:encoding(UTF-8)', $qfn)` because of `use open`
open(my $fh, '>', $qfn)
   or die("Can't open \"$qfn\": $!\n");

while ( my $json = <> ) {
   my $data = from_json($_);

   my $tid = $data->{id};
   my $cid = $data->{customer_id};
   my ($l) = map $data->{$_}, grep /_login\z/, keys(%$data);
   my $sub = $data->{subject};

   my $subc = substr($sub, 0, $length_sv);
   say $subc;

   my $string = sprintf "| %-5s | %-1s | %-9s | %-${length_sv}s | %-11s | %-10s|",
      $time, $num, $tid, $subc, $cid, $l;
   say $string;
   say $fh $string;
}

我还用合适的 JSON 解析器替换了您的手写解析器。

我的脚本示例:

#!/usr/bin/perl
use strict;
use utf8;
#use locale;
use warnings;

my $curl = qq( curl -s "https://example.com/all" -H "authorization: Bearer $token" \
    --data-raw '{"scope":[{"by_subject":{"login":['"$id"']}}],"page":0,"page_size":10}' \
    --compressed 2>/dev/null );
my $response = `$curl`;

$response =~ s/\},\{/\}###\{/g;
my @array = split(/###/, $response);

my $length_sv = 9;
#open my $out, '>>:encoding(UTF-8)', "filename" or warn "Could not open file - $!" and exit(1);
open my $out, '>>', "filename" or warn "Could not open file - $!" and exit(1);

foreach(@array) {
    my ($tid, $cid, $v3, $l, $v5, $sub) = $_ =~ /^\{"id":(\d+),"customer_id":(\d+)(.*?)_login":"(\w{1,10})"(.*?)"subject":"(.*?)"/;
    my $subc = substr($sub, 0, $length_sv);
    say $subc;
    my $string = sprintf "| %-5s | %-1s | %-9s | %-${length_sv}s | %-11s | %-10s|","$time","$num","$tid","$subc","$cid","$l";
    say $string;
    say $out $string;
}

close $out;

我删除了行 use locale 并将 open my $out, '>>:encoding(UTF-8)' 替换为 open my $out, '>>'。之后,文件以正确的编码写入。

  1. You should set encoding for input file where "Тест" come from.

Bug #1: $sub and thus $subc contains text encoded using UTF-8, but printing to a file handle with an encoding layer expects decoded text. The consequence is that you end up with "double-encoded" text in the file. You need to decode your input.

你能告诉我如何正确指定来自外部程序的传入数据的编码吗?

Bug #2: Fixing the first bug will reveal another. You added an encoding layer to your file handle, but not to STDOUT. To fix this, add an encoding layer decode your STDOUT too.

我认为正确的做法是添加:

binmode(STDOUT,':utf8');

问题仍然在于输入数据的编码指示。

我是这样输入的:

my $curl = qq( curl -s "https://example.com/all" -H "authorization: Bearer $token" \
    --data-raw '{"scope":[{"by_subject":{"login":['"$id"']}}],"page":0,"page_size":10}' \
    --compressed 2>/dev/null );
my $response = `$curl`;

感谢您的回复!你帮了大忙!