子 returns 一个匹配的正则表达式组

Question

我正在解析表格的行

12:34 SomeEvent: 0 Lorem ipsum dolor sit amet

我有一个处理程序子，它只获取一行并使用 given/when 将其传递给基于正则表达式匹配的更具体的处理程序子——例如，上面的行将传递给 _someevent子

在这些特定的处理程序子中，我想提取行的 0 部分，就像一个 ID。

我为此编写了以下子程序：

sub _getid ($) { $_[0] =~ /\d+:\d+ \w+: (\d+)/ }

这个 sub 在像这样使用时似乎可以工作：

say _getid("12:34 SomeEvent: 0 Lorem ipsum dolor sit amet\n");

但是当我将结果赋给一个变量时：

my $id = _getid("12:34 SomeEvent: 0 Lorem ipsum dolor sit amet\n");
say "ID = $id";

它总是变成 1。我相信这与 =~ 正则表达式实际上匹配 returns 一个列表或其他东西的事实有关，我正在将它分配给一个标量......?

我想出了以下方法：

sub _getid ($) {
    $_[0] =~ /\d+:\d+ \w+: (\d+)/;
    ; # or return ;
}

但必须有更好、更优雅的方法来解决这个问题。

Answer 1

你被上下文烧毁了。来自 perlop (specifically, the section on Regexp Quote-Like Operators):

/PATTERN/msixpodualngc

Searches a string for a pattern match, and in scalar context returns true if it succeeds, false if it fails.

以后：

Matching in list context

If the /g option is not used, m// in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, that is, (, , ...) (Note that here etc. are also set). When there are no parentheses in the pattern, the return value is the list (1) for success. With or without parentheses, an empty list is returned upon failure.

转向您的代码。

say _getid("12:34 SomeEvent: 0 Lorem ipsum dolor sit amet\n");

say() 将列表上下文强加在它的参数上，所以你得到一个捕获列表。您只有一个捕获，因此列表只有一个元素（您的 ID），这就是打印的内容。

my $id = _getid("12:34 SomeEvent: 0 Lorem ipsum dolor sit amet\n");

分配给比例变量是标量上下文的一个非常明显的例子。所以你得到了文档中第一个摘录中描述的行为。你看到的“1”才是真值。

[更新： 我对问题的解释（这一点以上的所有内容）很好。但是我建议的修复（低于这一点的内容）并不像我原先想象的那样有用。来自 TLP 和 ikegami 的其他答案都包括更好的解决方案。]

要解决此问题，您需要在子例程调用中强加列表上下文。最简单的方法是用列表赋值替换标量赋值 - 通过在变量两边加上括号。

my ($id) = _getid("12:34 SomeEvent: 0 Lorem ipsum dolor sit amet\n");

Answer 2

确保子例程始终returns标量的一种优雅（？）方法是在正则表达式匹配返回的列表上使用下标：

sub _getid {
    ($_[0] =~ /\d+:\d+ \w+: (\d+)/)[0];    # subscript makes parenthesis return
                                           # 1st element of list
}

当然，这都是非常代码“golfy”的。我可能会更明确地编写这个子例程，使代码对其他人来说实际上是可读的：

sub _getid {
    my $str = shift;
    my ($return) = $str =~ /\d+:\d+ \w+ (\d+)/;
    return $return;
}

关于您的代码的一些注释。

请注意，当您使用 $_[0] 时，您可能会无意中更改参数，因为您正在直接访问它。一个更安全的选择是将内容复制到一个新的、词法范围的变量，就像我上面的例子一样。

考虑例如sub foo { $_[0]++ }。如果你运行 my $foo = 0; foo($foo); print $foo; 这将打印 1，表明 $foo 被子程序改变了。如果你尝试 foo(2) 你也会得到相当奇怪的错误 Modification of a read-only value attempted.

您可能不应该为您的子程序使用原型。它们在 Perl 中有特殊用途，这与大多数人所想的不同。 IE。你应该 sub foo { ... } 而不是 sub foo ($) { ... }。文档 here

Answer 3

代码按照设计的方式工作，而不是 OP 所期望的。

第一个错误隐藏在匹配模式中，因为它没有考虑 SomeEvent.

之后的 :

标量上下文中的匹配结果将指示是否存在匹配——将其视为 bool 变量。

如果使用修饰符 /g 并且字符串中出现多个匹配项，则匹配结果将是匹配项的计数。

如果匹配 OP 的左侧有一个列表（数组）变量，那么他会用匹配的组填充数组，但原始代码没有使用这种方法。

OP 应该做什么在 _getid() 子例程的修改版本中进行了演示。

use strict;
use warnings;
use feature 'say';

my $str = "12:34 SomeEvent: 0 Lorem ipsum dolor sit amet\n";
my $var;

$var = $str =~ /\d+:\d+ \w+ (\d+)/;
say "-[$var]-";

$var = $str =~ /\d+:\d+ \w+: (\d+)/;
say "-[$var]-";

my $id = _getid($str);
say '_getid returned: ' . $id;

sub _getid {
    my $str = shift;
    
    return  if $str =~ /\d+:\d+ \w+: (\d+)/;
    
    return undef;
}

输出

-[]-
-[1]-
_getid returned: 0

文档：perlre

Answer 4

你有这个：

sub _getid ($) {
    $_[0] =~ /\d+:\d+ \w+ (\d+)/;
    ; # or return ;
}

如果字符串不匹配（返回一些“随机”字符串），上面的操作就会失败。以下也有效，但失败更安全：

# Match in scalar context returns whether the match succeeded or not.
# Returns , or undef if no match.
sub _getid { $_[0] =~ /\d+:\d+ \w+ (\d+)/ ?  : undef }

# Match in list context returns captures.
# Using a slice, this returns , or undef if no match.
sub _getid { ( $_[0] =~ /\d+:\d+ \w+ (\d+)/ )[0] }

子 returns 一个匹配的正则表达式组

sub that returns a matched regex group

regex

perl