在 unicode 环境中使用 sed 或类似命令进行多重搜索和替换

Question

我有一些 *.txt 文件，放在 c:\apple 中，它的子目录在 WINDOWS 7 环境中。例如：

c:\apple\orange
c:\apple\pears ....etc

但 c:\apple 中的子文件夹数量未知

我有一个文本文件（比如 sample.txt），类似于配置文件，结构是：

綫 &#32171;
胆 &#32966;
湶 &#28278;
峯 &#23791;

中文字符和字符串之间有一个space。

我希望我能使用这个文件sample.txt文件，搜索C:\APPLE\及其子目录中的所有文本文件，找出那些中文字符并替换为后面的字符。

我试过 sed 但在汉字上没有成功。

sed -r "s/^(.*) (.*)/s@@@/g" c:\temp\sample.txt *.txt

有人有想法吗？

Answer 1

假设您的文本文件包括 sample.txt 是用 UTF-16LE 编码的，请尝试：

perl -e '
use utf8;
use File::Find;

$topdir = "c:/apple";               # top level of subfolders
$mapfile = "c:/temp/sample.txt";    # config file to map character to code
$enc = "utf16le";                   # character coding of texts

open(FH, "<:encoding($enc)", $mapfile) or die "$mapfile: $!";
while (<FH>) {
    @_ = split(" ");
    $map{$_[0]} = $_[1];
}
close(FH);

find(\&process, $topdir);

sub process {
    my $file = $_;
    if (-f $file && $file =~ /\.txt$/) {
        my $tmp = "$file.tmp";
        my $lines = "";
        open(FH, "<:encoding($enc)", $file) or die "$file: $!";
        open(W, ">:encoding($enc)", $tmp) or die "$tmp: $!";
        while (<FH>) {
            $lines .= $_;           # slurp all text
        }
        foreach $key (keys %map) {
            $lines =~ s/$key/$map{$key}/ge;
        }
        print W $lines;
        close(FH);
        close(W);
        rename $file, "$file.bak";  # back-up original file
        rename $tmp, $file;
    }
}'

我需要告诉你我没有在 Windows 执行环境中测试代码（它是在 Linux 上用 Windows 文件测试的）。如果有问题，请告诉我。您可能需要将分配修改为 $topdir、$mapfile 或 $enc.

在 unicode 环境中使用 sed 或类似命令进行多重搜索和替换

multi search and replace with sed or similar commands in a unicode environment

windows

unicode

replace

sed

cjk