拆分字符串但忽略带引号的定界符的正则表达式

Question

我正在编写一个 Perl 程序，需要解析用 Wiki 标记语言编写的 table。 table 语法使用管道符 '|'分隔列。

| row 1 cell 1    |row 1 cell 2  | row 1 cell 3|
| row 2 cell 1    | row 2 cell 2 |row 2 cell 3|

一个单元格可以包含零个或多个超链接，其语法说明如下：

[[wiki:path:to:page|Page Title]]   or
[[wiki:path:to:page]]

请注意，超链接可能包含竖线字符。但是，这里是 [[..]] 括号中的 "quoted"。

超链接语法不能嵌套。

为了匹配和捕获每个 table 行中的第一个单元格，

| Potatoes [[path:to:potatoes]]           | Daisies           |
| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|

我试过了：

qr{\|                      # match literal pipe
    (.*?                   # non-greedy zero or more chars
        (?:\[\[.*?\]\])    # a hyperlink 
     .*?)                  # non-greedy zero or more chars
   \|}x                    # match terminating pipe

成功了，$1 包含单元格内容。

然后，匹配

| Potatoes            | Daisies           |

我尝试将超链接设为可选：

qr{\|                      # match literal pipe
    (.*?                   # non-greedy zero or more chars
        (?:\[\[.*?\]\])?   # <-- OPTIONAL hyperlink 
     .*?)                  # non-greedy zero or more chars
   \|}x                    # match terminating pipe

这有效，但在解析时

| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|

我只得到

 Kiki fruit [[path:to:kiwi

很明显，在给定选项的情况下，它决定忽略超链接模式并将嵌入的管道视为列定界符。

我卡在这里了。而且我仍然没有处理超链接在一个单元格中出现多次的可能性，或者在下一次迭代中将尾部管道返回为前导管道。

没有必要在 Perl 的 split 函数中使用正则表达式 -- 如果更容易的话，我可以自己编写拆分循环。我看到很多类似的问题被问到，但 none 似乎对这个问题的处理足够密切。

Answer 1

$ perl -MRegexp::Common -E '$_=shift; while (
  /\| # beginning pipe, and consume it
  (   # capture 1
    (?:  # inside the pipe we will do one of these:
      $RE{balanced}{-begin=>"[["}{-end=>"]]"} # something with balanced [[..]]
      |[^|] # or a character that is not a pipe
    )* # as many of those as necessary
  ) # end capture one
  (?=\|) # needs to go to the next pipe, but do not consume it so g works
  /xg
) { say  }' '| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|'
 Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  
             Lemons

这似乎提取了您要查找的内容。但是，我怀疑您最好使用适合这种语言的解析器。如果 cpan 上没有任何东西，我会感到惊讶，但即使没有，为此编写一个解析器可能仍然更好，尤其是当你开始在你的表中得到更多你需要处理的奇怪的东西时。

拆分字符串但忽略带引号的定界符的正则表达式

regexp that splits a string but ignores a quoted delimiter

regex

perl

split

delimiter

quotes