去除字符串但允许变音

Question

#!/usr/bin/perl -T
use strict;
use warnings;
use utf8;
my $s = shift || die;
$s =~ s/[^A-Za-z ]//g;
print "$s\n";
exit;

> ./poc.pl "El Guapö"
El Guap

有没有办法修改此 Perl 代码，使各种变音符号和字符重音不被删除？谢谢！

Answer 1

对于直接问题，您可能只需要 \p{L} (Letter) Unicode Character Property

然而，更重要的是，解码所有输入并编码输出。

use warnings;
use strict;
use feature 'say';

use utf8;   # allow non-ascii (UTF-8) characters in the source

use open ':std', ':encoding(UTF-8)';  # for standard streams

use Encode qw(decode_utf8);           # @ARGV escapes the above

my $string = 'El Guapö';
if (@ARGV) {
    $string = join ' ', map { decode_utf8($_) } @ARGV;
}
say "Input:     $string";

$string =~ s/[^\p{L} ]//g;

say "Processed: $string";

当运行为 script.pl 123 El Guapö=_

Input:     123 El Guapö=_
Processed:  El Guapö

我用的是"blanket"\p{L}属性（来信），具体描述不详；需要调整 if/as。 Unicode 属性提供了很多，请参阅上面的 link 和 perluniprops.

中的完整列表

123 El 之间的 space 仍然存在，最后可能会删除前导（和尾随）space。

注意还有\P{L}，大写的P表示否定。

上述头脑简单的 \pL 无法与 Combining Diacritical Marks, as the mark will be removed as well. Thanks to jm666 一起指出这一点。

当重音 "logical" 字符（显示为单个字符）使用单独的字符作为其基数和非间距标记（组合重音).通常它的单个字符 (extended grapheme cluster) 及其代码点也存在。

示例：在 niño 中 ñ 是 U+OOF1 但它也可以写成 "n\x{303}".

要保持重音以这种方式书写，请将 \p{Mn} (\p{NonspacingMark}) 添加到字符 class

my $string = "El Guapö=_ ni\N{U+00F1}o.* nin\x{303}o+^";
say $string;

(my $nodiac = $string) =~ s/[^\pL ]//g;      #/ naive, accent chars get removed
say $nodiac;

(my $full = $string) =~ s/[^\pL\p{Mn} ]//g;  # add non-spacing mark
say $full;

输出

El Guapö=_  niño.* niño+^
El Guapö niño nino
El Guapö niño niño

因此您需要 s/[^\p{L}\p{Mn} ]//g 以保持组合重音。

去除字符串但允许变音

Strip string but allow umlauts

regex

perl

diacritics