标点符号以某种方式打破 preg_match_all 组捕获

Question

考虑这个功能

function Split_Sentence($string, $asalpha)
{
 preg_match_all("~(?<han>\p{Han}+)|(?<alpha>[a-z\d$asalpha]+)|(?<other>\S+)~ui", $string, $out)

 foreach($out as $group_key=>$group)
 {
   if(!is_numeric($group_key))
   {  
    // discard indexed groups 
    foreach($group as $i=>$v)
    { 
     if(mb_strlen($v))
     {   
      $res[$i]=['type'=>$group_key,'text'=>$v];
     }
    }
   }
  }
  
  ksort($res);
  return $res;
}

（其中 $ashalpha 是无论如何都要匹配为“alpha”的字符系列）

此函数用于解析句子并将其分解为汉字、字母或“其他”字符组。

标点符号似乎打破了它，我似乎无法弄清楚为什么。如果涉及标点符号，则以标点符号开头的整个块匹配为“其他”。

例如“你好中国朋友你好欢迎”正确returns

Array (
    [0] => Array
        (
            [type] => other
            [text] => hello
        )

    [1] => Array
        (
            [type] => han
            [text] => 中国朋友
        )

    [2] => Array
        (
            [type] => han
            [text] => 你好
        )

    [3] => Array
        (
            [type] => alpha
            [text] => and
        )

    [4] => Array
        (
            [type] => alpha
            [text] => welcome
        )

)

但是“你好中国朋友，你好欢迎” returns

Array
(
    [0] => Array
        (
            [type] => alpha
            [text] => hello
        )

    [1] => Array
        (
            [type] => han
            [text] => 中国朋友
        )

    [2] => Array
        (
            [type] => other
            [text] => ，你好and
        )

    [3] => Array
        (
            [type] => alpha
            [text] => welcome
        )

)

我错过了什么？

更新：问题似乎出在使用 S+ 而不是 S 的组“其他”。现在，虽然 S 将部分解决问题，但每个“其他”角色都是单独捕获。另一方面，S+ 将捕获多个“其他”字符作为一个组，但它还会包括汉字和字母字符，直到找到 space.

Answer 1

逗号与 \S+ 匹配，因为 \S 匹配除空格以外的任何字符，而 \S+ 模式匹配一个或多个非空格字符。它消耗了 \p{Han} 可以匹配的所有字符。它还将消耗所有 (?<alpha>[a-z\d$asalpha]+) 可以匹配的字符。

如果您想从 \S 中排除 \p{Han} 和 [a-z\d$asalpha]+，请使用

(?<han>\p{Han}+)|(?<alpha>[a-z\d$asalpha]+)|(?<other>[^\p{Han}a-z\d$asalpha\s]+)

参见 this regex demo。 [^\p{Han}a-z\d$asalpha\s]+ 匹配除中文字符、ASCII 小写字母、数字、其他 $asalpha 个字符和空白字符以外的一个或多个字符。

标点符号以某种方式打破 preg_match_all 组捕获

Punctuation somehow breaks preg_match_all group capture

php

regex

unicode

cjk