不同的正则表达式 preg_match_all 导致实时测试和我的脚本

Question

我有以下字符串：

{ Author = {Smith, John and James, Paul and Hanks, Tom}, Title = {{Some title}}, Journal = {{Journal name text}}, Year = {{2022}}, Volume = {{10}}, Number = {{11}}, Month = {{DEC}}, Abstract = {{Abstract text abstract text, abstract. Abstract text - abstract text? Abstract text! Abstract text abstract text abstract text abstract text abstract text abstract text abstract text abstract text, abstract text. Abstract text abstract text abstract text abstract text abstract text.}}, DOI = {{10.3390/ijms19113496}}, Article-Number = {{1234}}, ISSN = {{1234-5678}}, ORCID-Numbers = {{}}, Unique-ID = {{ISI:1234567890}}, }

我的目标是在关联数组中获取这些值。我正在尝试这个正则表达式：

/([a-zA-Z0-9\-\_]+)\s*=\s*(\{(.*)\}|\d{4})/

使用 preg_match_all，没有额外的参数（只是正则表达式、输入和输出），但是虽然它在像 this 这样的在线测试器上工作正常，但它 return all 我的 .php 脚本中的值，只是其中的一部分。特别是 abstract 和 author 不知何故从未匹配。我尝试更改参数（当前使用 U（默认情况下的非贪婪匹配），但它没有解决我的问题。非常感谢任何帮助。

Answer 1

改变你的模式：

/([a-zA-Z0-9\-\_]+)\s*=\s*(\{(.*)\}|\d{4})/

到

/([a-zA-Z0-9\-\_]+)\s*=\s*(\{[^}]+\}|\d{4})/

或者在代码中：

$s = '{Author = {Smith, John and James, Paul and Hanks, Tom}, Title = {{Some title}}, Journal = {{Journal name text}}, Year = {{2022}}, Volume = {{10}}, Number = {{11}}, Month = {{DEC}}, Abstract = {{Abstract text abstract text, abstract. Abstract text - abstract text? Abstract text! Abstract text abstract text abstract text abstract text abstract text abstract text abstract text abstract text, abstract text. Abstract text abstract text abstract text abstract text abstract text.}}, DOI = {{10.3390/ijms19113496}}, Article-Number = {{1234}}, ISSN = {{1234-5678}}, ORCID-Numbers = {{}}, Unique-ID = {{ISI:1234567890}}, }';
$p = '/(\b[-\w]+)\s*=\s*(\{([^}]+)\}|\d{4})/';

preg_match_all($p, $s, $m);
print_r($m);

Sandbox

这会让你更接近，但它需要更多的改进。基本上发生的事情是你将第一个 { 与最后一个 } 匹配，因为 .* 匹配任何 "greedy" 这意味着它会消耗所有匹配项。

你可以通过像这样 \{(.*?)\} 而不是原来的 \{(.*)\} 简单地使其非贪婪来获得高于 \{[^}]+\} 的模拟结果，但我不认为它读作嗯。

输出

 ...
[1] => Array
    (
        [0] => Author
        [1] => Title
        [2] => Journal
 ...

[2] => Array
    (
        [0] => {Smith, John and James, Paul and Hanks, Tom}
        [1] => {{Some title} //<--- lost }
        [2] => {{Journal name text} //<--- lost }

这里最简单的就是在里面加几个可选的{}或者\}?，这样至少可以收集到完整的标签：

  //note the \{\{? and \}?\}
  $p = '/(\b[-\w]+)\s*=\s*(\{\{?([^}]+)\}?\}|\d{4})/';

这会将 2 索引更改为：

[2] => Array
    (
        [0] => {Smith, John and James, Paul and Hanks, Tom}
        [1] => {{Some title}}
        [2] => {{Journal name text}}

但由于没有所需结果的示例，我只能做到这一点。

作为一方：

另一种方法（非正则表达式）是 trim {} 然后展开它 }, 然后循环并在 = 上展开。对格式有点烦躁。

像这样：

$s = '{Author = {Smith, John and James, Paul and Hanks, Tom}, Title = {{Some title}}, Journal = {{Journal name text}}, Year = {{2022}}, Volume = {{10}}, Number = {{11}}, Month = {{DEC}}, Abstract = {{Abstract text abstract text, abstract. Abstract text - abstract text? Abstract text! Abstract text abstract text abstract text abstract text abstract text abstract text abstract text abstract text, abstract text. Abstract text abstract text abstract text abstract text abstract text.}}, DOI = {{10.3390/ijms19113496}}, Article-Number = {{1234}}, ISSN = {{1234-5678}}, ORCID-Numbers = {{}}, Unique-ID = {{ISI:1234567890}}, }';

function f($s,$o=[]){$e=array_map(function($v)use(&$o){if(strlen($v))$o[]=preg_split("/\s*=\s*/",$v."}");},explode('},',trim($s,'}{')));return$o;}

print_r(f($s));

输出

Array
(
    [0] => Array
        (
            [0] => Author
            [1] => {Smith, John and James, Paul and Hanks, Tom}
        )

    [1] => Array
        (
            [0] =>  Title
            [1] => {{Some title}}
        )

    [2] => Array
        (
            [0] =>  Journal
            [1] => {{Journal name text}}
        )
   ...

Sandbox

未压缩版本：

/* uncompressed */
function f($s, $o=[]){
    $e = array_map(
        function($v) use (&$o){
            if(strlen($v)) $o[] = preg_split("/\s*=\s*/", $v."}");
        },
        //could use preg_split for more flexibility  '/\s*\}\s*,\s*/`
        explode(
            '},',
            trim($s, '}{')
        )
    );
    return $o;
}

这不是 "robust" 的解决方案，但如果格式始终与示例一样，它可能就足够了。反正看起来很酷。输出格式稍微好一点，但您可以 array_combine($m[1],$m[2]) 修复 Regex 版本。

你也可以给它一个数组，它会追加到它上面，例如：

print_r(f($s,[["foo","{bar}"]]));

输出：

Array
(
[0] => Array
    (
        [0] => foo
        [1] => {bar}
    )

[1] => Array
    (
        [0] => Author
        [1] => {Smith, John and James, Paul and Hanks, Tom}
    )

那么如果你想要其他格式：

//get an array of keys  ['foo', 'Author']
print_r(array_column($a,0));

//get an array of values ['{bar}', '{Smith, John ...}']
print_r(array_column($a,1));

//get an array with keys=>values ['foo'=>'{bar}', 'Author'=>'{Smith, John ...}']
print_r(array_column($a,1,0));

您当然可以直接将其烘焙到函数中 return。

总之很有趣，好好享受吧。

更新

正则表达式 (\{[^}]+\}|\d{4}) 的意思是：

(...)捕获组，捕获所有包含在(和)
\{ 按字面意思匹配 {
[^}]+ 匹配任何非 } 一次或多次
\} 按字面意思匹配 }
| 或
\d{4} 匹配 0-9 4 次。

基本上这个 (\{(.*)\} 而不是 \{[^}]+\} 的问题是 .* 也匹配 } 和 {，并且因为它是贪婪的（不是尾随 ? 例如 \{(.*?)\}) 它将匹配所有可能的内容。所以实际上它会匹配这个 fname={foo}, lname={bar}，所以它会匹配第一个 { 和最后一个 } 或 {foo}, lname={bar} 之间的所有内容。然而，带有 "not" } 的正则表达式只匹配第一个 } 因为 [^}]+ 不会匹配 foo} 中的结尾 }改为由 \} 匹配，从而完成模式。如果我们使用另一个 (.*) 它实际上匹配最后一个 } 并捕获字符串中第一个 { 和最后一个 } 之间的所有内容。

说说乐行

嵌套对于正则表达式来说真的很难。正如我在评论中所说，词法分析器更好。所涉及的不是匹配像这样的大模式：/([a-zA-Z0-9\-\_]+)\s*=\s*(\{[^}]+\}|\d{4})/ 你匹配像这样的小模式

[
  '(?P<T_WORDS>\w+)', ///matches a-zA-Z0-9_
  '(?P<T_OPEN_BRACKET>\{)', ///matches {
  '(?P<T_CLOSE_BRACKET>\})',  //matches }
  '(?P<T_EQUAL>=)',  //matches =
  '(?P<T_WHITESPACE>\s+)', //matches \r\n\t\s
  '(?P<T_EOF>\Z+)', //matches end of string
];

你可以用 or

把它们放在一起

  "(?P<T_WORD>\w+)|(?P<T_OPEN_BRACKET>'{')|(?P<T_CLOSE_BRACKET>'}')|(?P<T_EQUAL>'=')|(?P<T_WHITESPACE)\s+|(?P<T_EOF)\Z+",

(?P<name>..) 是一个命名的捕获组，只是让事情变得更简单。而不是像这样的匹配：

[
   1 => [ 0 => 'Title', 1 => ''],
]

你还会有这个：

[
   1 => [ 0 => 'Title', 1 => ''],
   'T_WORD' => [ 0 => 'Title', 1 => '']
]

它可以更轻松地将令牌名称分配回匹配项。

无论如何，这个阶段的目标是打赌（最终）得到一个带有 "tokens" 的数组或像这样的匹配名称：例如。 Title = {{Some title}}

  //token stream
 [
    'T_WORD' => 'Title',   //keyword
    'T_WHITESPACE' => ' ', //ignore
    'T_EQUAL' => '=',      //instruction to end key,
    'T_WHITESPACE' => ' ', //ignore
    'T_OPEN_BRACKET' => '{', //inc a counter for open brackets
    'T_OPEN_BRACKET' => '{', //inc a counter for open brackets
    'T_WORD' => 'Some',      //capture as value
    'T_WHITESPACE' => ' ',   //capture as value
    'T_WORD' => 'title',     //capture as value
    'T_CLOSE_BRACKET' => '}', //dec a counter for open brackets
    'T_CLOST_BRACKET' => '}', //dec a counter for open brackets
   ]

这应该是相当直截了当的，但关键区别在于，在纯正则表达式中，你不能计算 { 和 } 所以你无法验证语法字符串，它匹配或不匹配。

使用词法分析器版本，您可以计算这些事情并采取适当的行动。这是因为您可以遍历标记匹配和 "test" 字符串。例如我们可以这样说：

后跟 = 的单词是属性名称。 { 一两个 } 中的任何内容都必须以与 } 相同数量的 { 结束，并且 { 和 } 中的任何内容都必须以 } 结尾 } 是我们需要的一些 "information"。忽略 {} 对之外的任何 space...等等。它使用 "Granularity" 我们需要验证此类数据。

我提到这个是因为即使我给你的例子 /(\b[-\w]+)\s*=\s*(\{\{?([^}]+)\}?\}|\d{4})/ 也会在这样的字符串上失败

 Author = {Smith, John and James, {Paul and Hanks}, Tom}

其中 return 匹配

 Author 
{Smith, John and James, {Paul and Hanks}

再比如这会导致问题失败：

Title = {{Some title}, Journal = {{Journal name text}}

这将给出这样的匹配：

Title 
Some title
//and
Journal 
Journal name text

这看起来是正确的，但并不是因为 {{Some title} 缺少 }。您如何处理字符串中的无效语法取决于您，但在 Regex 版本中，我们无法控制它。我应该提到即使是递归正则表达式 ('match pairs of brackets') 也会在这里失败 returning 类似：

{{Some title}, Journal = {{期刊名称文本}

但是在词法分析器版本中，我们可以递增一个计数器 { +1 { +1 然后是单词 Some title 然后是 } -1 然后我们剩下一个 1 而不是 0。所以在我们的代码中，我们知道我们缺少一个 } 应该在的地方。

下面是我写的词法分析器的一些例子（里面甚至还有一个空的）

https://github.com/ArtisticPhoenix/MISC/tree/master/Lexers

词法分析器（即使是基本的）比纯正则表达式解决方案更难实现，但将来会更容易使用和维护。希望对解释匹配和词法分析之间的区别有意义。

本质上，对于一个复杂的大模式，所有的复杂性都融入了模式，因此很难改变。对于较小的模式，模式的复杂性会由于其解析方式（您的代码指令）而出现，从而更容易针对边缘情况等进行调整。

祝你好运！

不同的正则表达式 preg_match_all 导致实时测试和我的脚本

Different regex preg_match_all results in live test and my script

php

regex

string

bibtex