preg_split :根据非常具体的模式拆分字符串

preg_split : splitting a string according to a very specific pattern

Regex/PHP n00b 这里。我正在尝试使用 PHP "preg_split" 函数...

我有一些遵循特定模式的字符串,我想根据该模式拆分它们。

字符串示例:

CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION

想要的结果:

[0]CADAVRES
[1]FILM
[2]Canada : Québec
[3]Érik Canuel
[4]2009
[5]long métrage
[6]FICTION

分隔符(按出现顺序):

" ["
"] ("
", "
", "
", "
") "

如何正确编写正则表达式?

这是我试过的方法:

<?php
$pattern = "/\s\[/\]\s\(/,\s/,\s/,\s/\)\s/";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split($pattern, $string);
print_r($keywords);

它不起作用,我不明白我做错了什么。再一次,我刚刚开始尝试处理正则表达式和 PHP,所以是的......转义字符太多,我看不对......

非常感谢!

我设法使用 preg_match_all 找到解决方案:

$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match_all("|[^-\[\](),/\s]+(?:(?: :)? [^-\[\](),/]+)?|", $input, $matches);
print_r($matches[0]);

Array
(
    [0] => CADAVRES
    [1] => FILM
    [2] => Canada : Québec
    [3] => Érik Canuel
    [4] => 2009
    [5] => long métrage
    [6] => FICTION
)

上面的正则表达式将术语视为不同于括号、逗号、圆括号等的任何字符。它还允许两个单词术语,中间可能有一个冒号分隔符。

这是 preg_match 的尝试:

$pattern = "/^([^\[]+)\[([^\]]+)\]\s+\(([^,]+),\s+([^,]+),\s+([^,]+),\s+([^,]+)\)\s+(.+)$/i";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match($pattern, $string, $keywords);
array_shift($keywords);
print_r($keywords);

输出:

Array
(
    [0] => CADAVRES 
    [1] => FILM
    [2] => Canada : Québec
    [3] => Érik Canuel
    [4] => 2009
    [5] => long métrage
    [6] => FICTION
)

Try it!

正则表达式细分:

^   anchor to start of string
 (    begin capture group 1
  [^\[]+   one or more non-left bracket characters
        )   end capture group 1
         \[   literal left bracket
           (   begin capture group 2
            [^\]]+   one or more non-right bracket characters
                  )    end capture group 2
                   \]   literal bracket
                     \s+    one or more spaces
                        \(    literal open parenthesis
                          (     open capture group 3
                           [^,]+   one or more non-comma characters
                                )     end capture group 3
                                 ,\s+     literal comma followed by one or more spaces
                                     ([^,]+),\s+([^,]+),\s+([^,]+)   repeats of the above
                                                                  \)   literal closing parenthesis
                                                                    \s+   one or more spaces
                                                                       (  begin capture group 7
                                                                        .+  everything else
                                                                           )  end capture group 7
                                                                            $ EOL

这假设您的结构是静态的并且不是特别漂亮,但另一方面,对于潜入它们不应该出现的字段的定界符应该是健壮的。例如,标题中有 :, 似乎是合理的,并且会破坏 "split on these delimiters anywhere" 类型的解决方案。例如,

"Matrix:, Trilogy()   [FILM, reviewed: good]    (Canada() :   Québec  ,  \t Érik Canuel , ): 2009 ,   long ():():[][]métrage) FICTIO  , [(:N";

正确解析为:

Array
(
    [0] => Matrix:, Trilogy()   
    [1] => FILM, reviewed: good
    [2] => Canada() :   Québec  
    [3] => Érik Canuel 
    [4] => ): 2009 
    [5] => long ():():[][]métrage
    [6] => FICTIO  , [(:N
)

Try it!

此外,如果括号内的逗号区域长度可变,您可能需要先提取并解析它,然后再处理字符串的其余部分。

您可以使用此正则表达式拆分:

([^\w:]\s[^\w:]?|\s[^\w:])

它寻找一个非(单词或:)字符,然后是一个space,然后是一个可选的非(单词或:)字符;或 space 后跟一个非(单词或 :)字符。这将匹配您想要的所有拆分模式。在 PHP 中(注意你需要 u 修饰符来处理 unicode 字符):

$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split('/([^\w:]\s[^\w:]?|\s[^\w:])/u', $input);
print_r($keywords);

输出:

Array
(
    [0] => CADAVRES 
    [1] => FILM
    [2] => Canada : Québec
    [3] => Érik Canuel
    [4] => 2009
    [5] => long métrage
    [6] => FICTION
)

Demo on 3v4l.org