填充文件中包含俄语西里尔字符的文件无效 - 一个俄语字符计为 2 个字节

Question

我正在尝试在 Unix 中创建一个具有固定列长度的文件。该文件包含俄语西里尔字符，这些字符的解释与正常的 1 字节字符不同。

我正在使用下面的脚本来修改文件（列的分隔符是@-@，行的分隔符是\r\n）：

input_file=
output_file=

awk -F '@-@' '{printf("%-200s%-200s%-200s%-200s%-200s%-200s%-200s%-200s\r\n", , , , , , , , )}' $input_file > $output_file

对于具有正常字符的列，输出文件正确包含 200 个字符列，但对于具有 30 个西里尔字符的列，输出列仅包含 170 个字符。这样，文件中的行将不会有相同的长度，因为西里尔字符占用 2 个字节，代码将解释字节而不是字符。

示例：НИКОЛАЕВНА 有 10 个字符，但脚本计算它有 20 个字符，因为它占用 20 个字节。

一个输入文件示例：

НИКОЛАЕВНА@-@russ@-@12345@-@asklle@-@НИКОЛАЕВНА@-@454@-@111@-@asdfg

能否请您提出一种创建填充的方法，以便所有行具有相同数量的字符？

谢谢！

Answer 1

我建议您使用 gawk 基于字符的字符串函数 substr，以 trim 您的字符串。标准 gawk printf 宽度格式化函数是基于字符的。检查您是否使用最新的 gawk.

到 trim 你所有的字段到 200 个字符：

for (i = 1; i <= NF; i++) $i = substr($i,1,200);

所以你的脚本应该是：

awk -F '@-@' '{for(i=1;i<=NF;i++)$i=substr($i,1,200);printf("%-200s%-200s%-200s%-200s%-200s%-200s%-200s%-200s\r\n", , , , , , , , )}' $input_file > $output_file

或者更简洁：

script.awk

{
    for (i = 1; i <= 8; i++) {
        $i = substr($i,1,200);
        printf("%-200s", $i);
    }
    print;
}

Answer 2

我不相信 awk 可以做到这一点，但只要您的语言环境未设置为 "C"，gawk 应该默认处理这个问题。例如，LC_ALL=en_US.UTF-8 应该使用 gawk 提供预期的行为。

Answer 3

尝试以下 awk 脚本：

script.awk

BEGIN {FS="@-@"; # field separator is '@-@'
    h="          "; # length(h) = 10
    h=h h h h h h h h h h; # length(h) = 100
    h=h h; # length(h) = 200
}
{
    for (i = 1; i <= 8; i++) {
        #length is character based function
        head = substr(h,1,(length(h)-length($i))); # cut alignment head to the correct length
        printf("%s%s", head, $i); # output the current aligned field
    }
    print;
}

填充文件中包含俄语西里尔字符的文件无效 - 一个俄语字符计为 2 个字节

Padding for a file containing Russian Cyrillic characters in a file not working - one Russian character is counted as 2 bytes

unix

cyrillic

script.awk

script.awk