使用搜索语法解析搜索词的正则表达式模式

Question

我正在编写一个搜索词解析器来对搜索标记进行分类，以供以后 post 处理。到目前为止我有这个模式：

/([+])?([\-])?(\"([^\"]+)?\"?|([^\s]+)?|([^*]+)?)([\s])?/

获取示例搜索字符串，例如：

c++ +this -only this* +"is a very" "complex example"

我想得到以下结果

G1   G2    G3                 G4                G5     G6   G7
           c++                                  c++         [space]
+          +this                                this        [space]
     -     -only                                only        [space]
           this*                                this   *    [space]
+          "is a very"        is a very                     [space]
           "complex example"  complex example               [space]

我得到的结果与上面的匹配项几乎相同，但 this* 项在第 5 组中显示为 this*。

我知道 ... ([^\s]+)?|([^*]+)?) ... 部分不正确，但我不知道如何重新表述它。我尝试了几种方法，但似乎没有通过交换子模式等找到好的解决方案。如果有人能给我一些关于如何解决这个问题并可能使搜索词匹配部分更有效的提示，我会很高兴。

这是我的测试脚本：

<?php
$s = "c++ +this -only this* +\"is a very\" \"complex example\"";
$rc = preg_match_all(
        "/([+])?([\-])?(\"([^\"]+)?\"?|([^\s]+)?|([^*]+)?)([\s])?/",
    $s,
    $m);

print_r($m);
?>

非常感谢！

Answer 1

我不知道你为什么要区分 G1 和 G2。这是一个工作模式：

([-+]?)("([^"]+)"|([^\s*]+)(\*?))(\s)?

您的模式存在问题，因为您使用的是 ([^\s]+)?|([^*]+)?)。由于 test* 会满足选项中的第一个条件，因此永远不会比较第二个选项。

PHP 实施将是：

$re = "~([-+]?)(\"([^\"]+)\"|([^\s*]+)(\*?))(\s)?~";
$str = "c++ +this -only this* +\"is a very\" \"complex example\"";
preg_match_all($re, $str, $matches);

使用这种模式的缺点是每个单词都有一个空白的 G5（table 中的 G6）。您可以对极端情况使用前瞻性，但我不会太担心它。

使用搜索语法解析搜索词的正则表达式模式

Regex pattern to parse search terms with search syntax

regex

pcre

preg-match-all