preg_split :根据非常具体的模式拆分字符串
preg_split : splitting a string according to a very specific pattern
Regex/PHP n00b 这里。我正在尝试使用 PHP "preg_split" 函数...
我有一些遵循特定模式的字符串,我想根据该模式拆分它们。
字符串示例:
CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION
想要的结果:
[0]CADAVRES
[1]FILM
[2]Canada : Québec
[3]Érik Canuel
[4]2009
[5]long métrage
[6]FICTION
分隔符(按出现顺序):
" ["
"] ("
", "
", "
", "
") "
如何正确编写正则表达式?
这是我试过的方法:
<?php
$pattern = "/\s\[/\]\s\(/,\s/,\s/,\s/\)\s/";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split($pattern, $string);
print_r($keywords);
它不起作用,我不明白我做错了什么。再一次,我刚刚开始尝试处理正则表达式和 PHP,所以是的......转义字符太多,我看不对......
非常感谢!
我设法使用 preg_match_all
找到解决方案:
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match_all("|[^-\[\](),/\s]+(?:(?: :)? [^-\[\](),/]+)?|", $input, $matches);
print_r($matches[0]);
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
上面的正则表达式将术语视为不同于括号、逗号、圆括号等的任何字符。它还允许两个单词术语,中间可能有一个冒号分隔符。
这是 preg_match
的尝试:
$pattern = "/^([^\[]+)\[([^\]]+)\]\s+\(([^,]+),\s+([^,]+),\s+([^,]+),\s+([^,]+)\)\s+(.+)$/i";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match($pattern, $string, $keywords);
array_shift($keywords);
print_r($keywords);
输出:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
正则表达式细分:
^ anchor to start of string
( begin capture group 1
[^\[]+ one or more non-left bracket characters
) end capture group 1
\[ literal left bracket
( begin capture group 2
[^\]]+ one or more non-right bracket characters
) end capture group 2
\] literal bracket
\s+ one or more spaces
\( literal open parenthesis
( open capture group 3
[^,]+ one or more non-comma characters
) end capture group 3
,\s+ literal comma followed by one or more spaces
([^,]+),\s+([^,]+),\s+([^,]+) repeats of the above
\) literal closing parenthesis
\s+ one or more spaces
( begin capture group 7
.+ everything else
) end capture group 7
$ EOL
这假设您的结构是静态的并且不是特别漂亮,但另一方面,对于潜入它们不应该出现的字段的定界符应该是健壮的。例如,标题中有 :
或 ,
似乎是合理的,并且会破坏 "split on these delimiters anywhere" 类型的解决方案。例如,
"Matrix:, Trilogy() [FILM, reviewed: good] (Canada() : Québec , \t Érik Canuel , ): 2009 , long ():():[][]métrage) FICTIO , [(:N";
正确解析为:
Array
(
[0] => Matrix:, Trilogy()
[1] => FILM, reviewed: good
[2] => Canada() : Québec
[3] => Érik Canuel
[4] => ): 2009
[5] => long ():():[][]métrage
[6] => FICTIO , [(:N
)
此外,如果括号内的逗号区域长度可变,您可能需要先提取并解析它,然后再处理字符串的其余部分。
您可以使用此正则表达式拆分:
([^\w:]\s[^\w:]?|\s[^\w:])
它寻找一个非(单词或:
)字符,然后是一个space,然后是一个可选的非(单词或:
)字符;或 space 后跟一个非(单词或 :
)字符。这将匹配您想要的所有拆分模式。在 PHP 中(注意你需要 u
修饰符来处理 unicode 字符):
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split('/([^\w:]\s[^\w:]?|\s[^\w:])/u', $input);
print_r($keywords);
输出:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
Regex/PHP n00b 这里。我正在尝试使用 PHP "preg_split" 函数...
我有一些遵循特定模式的字符串,我想根据该模式拆分它们。
字符串示例:
CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION
想要的结果:
[0]CADAVRES [1]FILM [2]Canada : Québec [3]Érik Canuel [4]2009 [5]long métrage [6]FICTION
分隔符(按出现顺序):
" [" "] (" ", " ", " ", " ") "
如何正确编写正则表达式?
这是我试过的方法:
<?php
$pattern = "/\s\[/\]\s\(/,\s/,\s/,\s/\)\s/";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split($pattern, $string);
print_r($keywords);
它不起作用,我不明白我做错了什么。再一次,我刚刚开始尝试处理正则表达式和 PHP,所以是的......转义字符太多,我看不对......
非常感谢!
我设法使用 preg_match_all
找到解决方案:
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match_all("|[^-\[\](),/\s]+(?:(?: :)? [^-\[\](),/]+)?|", $input, $matches);
print_r($matches[0]);
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
上面的正则表达式将术语视为不同于括号、逗号、圆括号等的任何字符。它还允许两个单词术语,中间可能有一个冒号分隔符。
这是 preg_match
的尝试:
$pattern = "/^([^\[]+)\[([^\]]+)\]\s+\(([^,]+),\s+([^,]+),\s+([^,]+),\s+([^,]+)\)\s+(.+)$/i";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match($pattern, $string, $keywords);
array_shift($keywords);
print_r($keywords);
输出:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
正则表达式细分:
^ anchor to start of string
( begin capture group 1
[^\[]+ one or more non-left bracket characters
) end capture group 1
\[ literal left bracket
( begin capture group 2
[^\]]+ one or more non-right bracket characters
) end capture group 2
\] literal bracket
\s+ one or more spaces
\( literal open parenthesis
( open capture group 3
[^,]+ one or more non-comma characters
) end capture group 3
,\s+ literal comma followed by one or more spaces
([^,]+),\s+([^,]+),\s+([^,]+) repeats of the above
\) literal closing parenthesis
\s+ one or more spaces
( begin capture group 7
.+ everything else
) end capture group 7
$ EOL
这假设您的结构是静态的并且不是特别漂亮,但另一方面,对于潜入它们不应该出现的字段的定界符应该是健壮的。例如,标题中有 :
或 ,
似乎是合理的,并且会破坏 "split on these delimiters anywhere" 类型的解决方案。例如,
"Matrix:, Trilogy() [FILM, reviewed: good] (Canada() : Québec , \t Érik Canuel , ): 2009 , long ():():[][]métrage) FICTIO , [(:N";
正确解析为:
Array
(
[0] => Matrix:, Trilogy()
[1] => FILM, reviewed: good
[2] => Canada() : Québec
[3] => Érik Canuel
[4] => ): 2009
[5] => long ():():[][]métrage
[6] => FICTIO , [(:N
)
此外,如果括号内的逗号区域长度可变,您可能需要先提取并解析它,然后再处理字符串的其余部分。
您可以使用此正则表达式拆分:
([^\w:]\s[^\w:]?|\s[^\w:])
它寻找一个非(单词或:
)字符,然后是一个space,然后是一个可选的非(单词或:
)字符;或 space 后跟一个非(单词或 :
)字符。这将匹配您想要的所有拆分模式。在 PHP 中(注意你需要 u
修饰符来处理 unicode 字符):
$input = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
$keywords = preg_split('/([^\w:]\s[^\w:]?|\s[^\w:])/u', $input);
print_r($keywords);
输出:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)