perl6 如何获取所有未按 x 宽度缩进的行？

Question

我正在处理一个非常非常大的文本文件，其中包含各种缩进大小的行。那些可接受的行具有 12 个字符宽度的缩进，这些缩进是通过制表符和 space 的组合创建的。现在我想得到所有没有 12 个字符宽度缩进的行，并且这些行有 0 到 11 个字符宽度的缩进，来自制表符和 space-chars 的组合。

if $badLine !~~ m/ ^^ [\s ** 12 ||
                      \t \s ** 4 ||
                      \s \t \s ** 3 ] / { say $badLine; }

但问题是，当您使用文字处理器处理文本文件时，按 Tab 键可以为您提供 0 到 8 space-char-width 之间的任意值来填补空白。获得所有那些没有 12 个字符宽度缩进的不可接受的行的聪明方法是什么？

谢谢。

Answer 1

假设我正确理解了你的问题（如果我没有搞砸的话），一种方法应该是这样的：

# some test input
my \INPUT = qq:to/EOI/;
           11s
            12s
             13s
\t    1t 4s
 \t   1s 1t 3s
    4s
   \t    3s 1t 4s
        \t8s 1t
EOI

# compute indentation width
sub indent-width($_) {
    my $n = 0;

    # iterate over characters
    for .comb {
        # tabs only take enough space to fill an octet
        when "\t" { $n += 8 - $n % 8 }
        default { ++$n }
    }
    $n;
}

# generate output, see below
say ?/^ :r (\h+) <?{ indent-width(~[=10=]) == 12 }> /, " {.trim}"
    for INPUT.lines;

最后一段代码中有趣的部分是正则表达式

/^ :r (\h+) <?{ indent-width(~[=11=]) == 12 }> /

捕获输入开头的水平空白，然后是一个断言 <?{...}> 检查捕获 [=13=] 的宽度是否为 12。

请注意，我们还提供了 :r 修饰符，因此正则表达式引擎不会回溯：否则，我们还会匹配缩进超过 12 位的行。

Answer 2

宽度 12

对于 12 的缩进宽度，假设制表位位于位置 0、8、16 等：

for $input.lines {
    .say if not /
        ^                             # start of line
        [" " ** 8 || " " ** 0..7 \t]  # whitespace up to first tab stop
        [" " ** 4]                    # whitespace up to position 12
        [\S | $]                      # non-space character or end of line
    /;
}

解释：

要从行首（位置0）到第一个制表位（位置8），我们需要匹配的有两种可能：
- 8 个空格。
- 0 到 7 个空格，后跟 1 个制表符。 （制表符直接跳到制表位，所以它会填满空格后剩余的宽度。）
从制表位（位置 8）到缩进目标（位置 12）的唯一方法是 4 个空格。 （制表符会跳过目标到位置 16 处的下一个制表位。）
锚定到行首以及缩进之后的任何内容，这很重要，这样我们就不会意外地匹配较长缩进的一部分。

任意宽度

缩进匹配可以分解为可以处理任意宽度的参数化named token：

my token indent ($width) {
    [" " ** 8 || " " ** 0..7 \t] ** {$width div 8}
     " " ** {$width % 8}
}

.say if not /^ <indent(12)> [\S | $]/ for $input.lines;

解释：

与上面相同的表达式用于到达第一个制表位，但现在根据需要重复多次以到达 最后一个目标前的制表位。 (共$width div 8次，其中div为整数除法运算符).
无论最后一个制表位和目标之间剩下多少距离，都必须用空格填充。 ($width % 8 空格，其中 % 是模运算符。)

任意位置和宽度

上例中的标记假定它从制表位位置（例如行首）开始匹配。它可以进一步推广以匹配给定宽度的制表符和空格，无论您在行中的何处调用它：

my token indent ($width) {  
    :my ($before-first-stop, $numer-of-stops, $after-last-stop);
    {
        $before-first-stop = min $width, 8 - $/.from % 8;
        $numer-of-stops    = ($width - $before-first-stop) div 8;
        $after-last-stop   = ($width - $before-first-stop) % 8;
    }
    [" " ** {$before-first-stop} || " " ** {^$before-first-stop} \t]
    [" " ** 8 || " " ** 0..7 \t] ** {$numer-of-stops}
     " " ** {$after-last-stop}
}

解释：

与之前的原理相同，只是现在我们首先需要匹配尽可能多的空格，以便从字符串中的当前位置到它后面的第一个制表位。
当前在字符串中的位置由$/.from给出；剩下的就是简单的算术了。
在令牌中声明并使用了一些词法变量（希望具有描述性名称），以使代码更易于理解。

perl6 如何获取所有未按 x 宽度缩进的行？

perl6 How to get all lines that are not indented by x-width of spaces?

tabs

space

match

width

raku

宽度 12

任意宽度

任意位置和宽度