将全角字符转换为半角字符
Converting full-width characters to half-width characters
我有一个程序可以将全角字符转换为半角字符。它工作正常,除了数字零。全角零不转换为半角零。
Perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use utf8;
use feature qw(unicode_strings);
use open qw(:std :utf8);
unless ( @ARGV == 2 ) {
print "Usage: script.pl input_file output_file\n";
exit;
}
my %fwhw = (
'0' => '0', '1' => '1', '2' => '2', '3' => '3', '4' => '4',
'5' => '5', '6' => '6', '7' => '7', '8' => '8', '9' => '9',
'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E',
'F' => 'F', 'G' => 'G', 'H' => 'H', 'I' => 'I', 'J' => 'J',
'K' => 'K', 'L' => 'L', 'M' => 'M', 'N' => 'N', 'O' => 'O',
'P' => 'P', 'Q' => 'Q', 'R' => 'R', 'S' => 'S', 'T' => 'T',
'U' => 'U', 'V' => 'V', 'W' => 'W', 'X' => 'X', 'Y' => 'Y',
'Z' => 'Z', 'a' => 'a', 'b' => 'b', 'c' => 'c', 'd' => 'd',
'e' => 'e', 'f' => 'f', 'g' => 'g', 'h' => 'h', 'i' => 'i',
'j' => 'j', 'k' => 'k', 'l' => 'l', 'm' => 'm', 'n' => 'n',
'o' => 'o', 'p' => 'p', 'q' => 'q', 'r' => 'r', 's' => 's',
't' => 't', 'u' => 'u', 'v' => 'v', 'w' => 'w', 'x' => 'x',
'y' => 'y', 'z' => 'z', '-' => '-', '、' => ', ', ' ' => ' ',
'/' => '/',);
sub slurp {
my $file = shift;
open my $fh_read, '<', $file or die "Could not open file: $!";
return do {local $/; <$fh_read>};
}
sub convert {
my $sub_string = shift;
$sub_string =~ s/(.)/$fwhw{}?$fwhw{}:/seg;
return $sub_string;
}
my $string = slurp($ARGV[0]);
$string =~ s/<target>\s*<g id="\d+">\K(.*?)(?=<\/g>\s*<\/target>)/convert()/seg;
open my $fh_write, ">", $ARGV[1] or die "Could not open file: $!";
print $fh_write $string;
close $fh_write;
这是我试过的
我通过检查它们的代码点确定数字 0(零)和字母 O(哦)确实不同。全角 0 为 \x{ff10}
。全角字母 O 为 \x{ff2f}
。我用这段代码检查了这个:
use Encode;
sub codepoint_hex {
sprintf "%04x", ord Encode::decode("UTF-8", shift);
}
my $codepoint = codepoint_hex('0');
print $codepoint, "\n";
我检查过哈希确实正确加载了所有键和值。
我还没有尝试过的:
- 我还没有尝试在 Linux 上复制这个奇怪的东西。我在 Windows 10.
上使用 ActiveState Perl 5.24
如果有人有任何建议或看到我的错误,我将非常感谢指导。
由于 $fwhw{'0'}
returns 0
,并且 0
为假,因此不会发生替换。替换
$sub_string =~ s/(.)/$fwhw{}?$fwhw{}:/seg;
和
$sub_string =~ s/(.)/exists($fwhw{})?$fwhw{}:/seg;
如果还是不行,用sprintf "%vX", $str
看看你到底有什么。
顺便说一句,
sub convert {
my $sub_string = shift;
$sub_string =~ s/(.)/exists($fwhw{})?$fwhw{}:/seg;
return $sub_string;
}
如果换成
会快很多
sub convert {
state $chars = join '', keys(%fwhw);
state $re = qr/([\Q$chars\E])/;
return $_[0] =~ s/$re/$fwhw{}/gr;
}
更快,
sub convert {
state $s = join '', keys(%fwhw);
state $r = join '', values(%fwhw);
state $tr = eval("sub { $_[0] =~ tr/\Q$s\E/\Q$r\E/r }");
return $tr->($_[0]);
}
你不需要这么大的词典和这么多的支持功能。一个简单的sed
就够了
halfwidth='!"#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬¯¦¥₩ '
fullwidth='!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬ ̄¦¥₩ '
sed -ie "y/$fullwidth/$halfwidth/" your_file
如果你想在 perl 中这样做也很简单
perl -Mutf8 -i -C -pe 'BEGIN{ use open qw/:std :utf8/; } tr#!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬ ̄¦¥₩ #!"\#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬¯¦¥₩ # your_file'
我有一个程序可以将全角字符转换为半角字符。它工作正常,除了数字零。全角零不转换为半角零。
Perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use utf8;
use feature qw(unicode_strings);
use open qw(:std :utf8);
unless ( @ARGV == 2 ) {
print "Usage: script.pl input_file output_file\n";
exit;
}
my %fwhw = (
'0' => '0', '1' => '1', '2' => '2', '3' => '3', '4' => '4',
'5' => '5', '6' => '6', '7' => '7', '8' => '8', '9' => '9',
'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E',
'F' => 'F', 'G' => 'G', 'H' => 'H', 'I' => 'I', 'J' => 'J',
'K' => 'K', 'L' => 'L', 'M' => 'M', 'N' => 'N', 'O' => 'O',
'P' => 'P', 'Q' => 'Q', 'R' => 'R', 'S' => 'S', 'T' => 'T',
'U' => 'U', 'V' => 'V', 'W' => 'W', 'X' => 'X', 'Y' => 'Y',
'Z' => 'Z', 'a' => 'a', 'b' => 'b', 'c' => 'c', 'd' => 'd',
'e' => 'e', 'f' => 'f', 'g' => 'g', 'h' => 'h', 'i' => 'i',
'j' => 'j', 'k' => 'k', 'l' => 'l', 'm' => 'm', 'n' => 'n',
'o' => 'o', 'p' => 'p', 'q' => 'q', 'r' => 'r', 's' => 's',
't' => 't', 'u' => 'u', 'v' => 'v', 'w' => 'w', 'x' => 'x',
'y' => 'y', 'z' => 'z', '-' => '-', '、' => ', ', ' ' => ' ',
'/' => '/',);
sub slurp {
my $file = shift;
open my $fh_read, '<', $file or die "Could not open file: $!";
return do {local $/; <$fh_read>};
}
sub convert {
my $sub_string = shift;
$sub_string =~ s/(.)/$fwhw{}?$fwhw{}:/seg;
return $sub_string;
}
my $string = slurp($ARGV[0]);
$string =~ s/<target>\s*<g id="\d+">\K(.*?)(?=<\/g>\s*<\/target>)/convert()/seg;
open my $fh_write, ">", $ARGV[1] or die "Could not open file: $!";
print $fh_write $string;
close $fh_write;
这是我试过的
我通过检查它们的代码点确定数字 0(零)和字母 O(哦)确实不同。全角 0 为
\x{ff10}
。全角字母 O 为\x{ff2f}
。我用这段代码检查了这个:use Encode; sub codepoint_hex { sprintf "%04x", ord Encode::decode("UTF-8", shift); } my $codepoint = codepoint_hex('0'); print $codepoint, "\n";
我检查过哈希确实正确加载了所有键和值。
我还没有尝试过的:
- 我还没有尝试在 Linux 上复制这个奇怪的东西。我在 Windows 10. 上使用 ActiveState Perl 5.24
如果有人有任何建议或看到我的错误,我将非常感谢指导。
由于 $fwhw{'0'}
returns 0
,并且 0
为假,因此不会发生替换。替换
$sub_string =~ s/(.)/$fwhw{}?$fwhw{}:/seg;
和
$sub_string =~ s/(.)/exists($fwhw{})?$fwhw{}:/seg;
如果还是不行,用sprintf "%vX", $str
看看你到底有什么。
顺便说一句,
sub convert {
my $sub_string = shift;
$sub_string =~ s/(.)/exists($fwhw{})?$fwhw{}:/seg;
return $sub_string;
}
如果换成
会快很多sub convert {
state $chars = join '', keys(%fwhw);
state $re = qr/([\Q$chars\E])/;
return $_[0] =~ s/$re/$fwhw{}/gr;
}
更快,
sub convert {
state $s = join '', keys(%fwhw);
state $r = join '', values(%fwhw);
state $tr = eval("sub { $_[0] =~ tr/\Q$s\E/\Q$r\E/r }");
return $tr->($_[0]);
}
你不需要这么大的词典和这么多的支持功能。一个简单的sed
就够了
halfwidth='!"#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬¯¦¥₩ '
fullwidth='!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬ ̄¦¥₩ '
sed -ie "y/$fullwidth/$halfwidth/" your_file
如果你想在 perl 中这样做也很简单
perl -Mutf8 -i -C -pe 'BEGIN{ use open qw/:std :utf8/; } tr#!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬ ̄¦¥₩ #!"\#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬¯¦¥₩ # your_file'