Perl6 解析文件

Question

作为练习，我尝试解析一些标准文本，这些文本是 shell 命令的输出。

  pool: thisPool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: none requested
config:

    NAME                                                STATE     READ WRITE CKSUM
    homePool                                            ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7NUE93C      ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7RE2A4F      ONLINE       0     0     0
    cache
      ata-KINGSTON_SV300S37A60G_50026B7261025D7E-part3  ONLINE       0     0     0

errors: No known data errors

我想使用 Perl6 语法，我想在单独的标记或正则表达式中捕获每个字段。所以，我做了如下语法：

grammar zpool {
        regex TOP { \s+ [ <keyword> <collection> ]+ }
        token keyword { "pool: " | "state: " | "status: " | "action: " | "scan: " | "config: " | "errors: " }
        regex collection { [<:!keyword>]*  }
}

我的想法是，正则表达式找到一个关键字，然后开始收集所有数据，直到下一个关键字。但是，每次，我只得到 "pool: " -> 所有剩余的文本。

 keyword => ｢pool: ｣
 collection => ｢homePool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: none requested
config:

    NAME                                                STATE     READ WRITE CKSUM
    homePool                                            ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7NUE93C      ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7RE2A4F      ONLINE       0     0     0
    cache
      ata-KINGSTON_SV300S37A60G_50026B7261025D7E-part3  ONLINE       0     0     0

errors: No known data errors
｣

我不知道如何让它在找到关键字时停止吃字符，然后将其视为另一个关键字。

Answer 1

问题 1

您写的是 <:!keyword> 而不是 <!keyword>。那不是你想要的。您需要删除 :.

P6 正则表达式 matches a single character with the specified Unicode property 中的 <:foo> 语法，在本例中属性 :foo 又表示 :foo(True).

而<:!keyword>匹配单个字符与Unicode 属性 :keyword(False).

但是没有Unicode 属性 :keyword.

因此否定断言将始终为真，并且每次始终匹配输入的单个字符。

因此，如您所知，该模式只是在文本的其余部分中大嚼特嚼。

问题2

一旦解决了问题 1，就会出现第二个问题。

<:!keyword> 与 Unicode 属性 :keyword(False) 匹配单个字符。它会在每次匹配时自动咀嚼一些输入（单个字符）。

相比之下，<!keyword>如果匹配则不会消耗任何输入。 您必须确保使用它的模式能够处理输入。

解决这两个问题后，您将获得预期的输出结果。（您将看到的下一个问题是 config 关键字不起作用，因为在您的输入文件示例中 config: 中的 : 后面没有跟 space .)

所以，经过一些清理：

my @keywords = <pool state status action scan config errors> ;

say grammar zpool {
    token TOP        { \s+ [ <keyword> <collection> ]* }
    token keyword    { @keywords ': ' }
    token collection { [ <!keyword> . ]* }
}

我已将所有模式切换为 token 声明。一般来说，总是使用 token 除非你知道你需要别的东西。（regex 启用回溯。如果你不小心，这会大大减慢速度。rule 使 spaces 在规则 中很重要。）

我已将关键字提取到一个数组中。 @keywords 表示 @keywords[0] | @keywords[1] | ....

我在最后一个模式中的 <!keyword> 之后添加了一个 .（以消耗一个字符的输入值，以避免在 <!foo> 情况下发生的无限循环不消耗任何输入）。

如果您还没有见过他们，请注意 available grammar debugging options 是您的朋友。

Hth

Answer 2

尽管我时不时喜欢一个好的语法，但调用 split:

更容易解决这个问题

my $input = q:to/EOF/;
  pool: thisPool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: none requested
config:

    NAME                                                STATE     READ WRITE CKSUM
    homePool                                            ONLINE       0     0     0
      mirror-0                                          ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7NUE93C      ONLINE       0     0     0
        ata-WDC_WD5000AZLX-00CL5A0_WD-WCC3F7RE2A4F      ONLINE       0     0     0
    cache
      ata-KINGSTON_SV300S37A60G_50026B7261025D7E-part3  ONLINE       0     0     0

errors: No known data errors
EOF

my @delimiter = <pool state status action scan config errors>;
my %fields;
for $input.split( / ^^ \h* (@delimiter) ':' \h*/, :v)[1..*] -> $key, $value {
    %fields{ $key[0] } = $value.trim;
}

say %fields.perl;

这通过拆分已知键，丢弃第一个元素（因为我们知道输入以键而不是值开始），然后以锁步方式迭代键和值来工作。

既然您要求语法，我们可以通过将每个值替换为 .+?（任何字符串，但尽可能短）来轻松地将 split 调用转换为纯正则表达式。

现在让我们给它更多的结构：

my @delimiter = <pool state status action scan config errors>;
grammar ZPool {
    regex key      { @delimiter             }
    regex keychunk { ^^ \h* <key> ':'       }
    regex value    { .*?                    }
    regex chunks   { <keychunk> \h* <value> }
    regex TOP      { <chunks>+              }
}

我们可以努力从嵌套的匹配树中提取结果，或者使用有状态的操作对象作弊：

class ZPool::Actions {
    has $!last-key;
    has %.contents;
    method key($m)   { $!last-key = $m.Str                }
    method value($m) { %!contents{ $!last-key } = $m.trim }
}

然后使用它：

my $actions = ZPool::Actions.new;
ZPool.parse($input, :$actions);
say $actions.contents.perl;

key和keychunk不需要回溯，可以把regex改为token。

当然，使用 .+? 和回溯可以被认为是作弊，所以你可以使用 raiph 提到的技巧，在 value 正则表达式中使用负前瞻。

Perl6 解析文件

Perl6 Parse File

regex

grammar

raku