为什么这个正则表达式与 php 中的第一个结果不匹配?

Why does this regular expression not match the first result in php?

这是我的正则表达式:

❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱

这是测试文本(online demo in javascript 可以正常工作的地方):

Nulla imperdiet ❰❮6❯⦓“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.⦔❱❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱❰❮8❯⦓Etiam in congue turpis. Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱❰❮9-10❯⦓Aenean luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu, cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱ eu euismod.

但它在 php 中不起作用。也就是说,它不会检索第一个匹配项:即,从 ❰❮6❯⦓“vitae.⦔❱。有趣的是,如果我删除 Unicode 双引号 charterer (“),它工作正常,但添加它,使其不匹配第一个匹配项。为什么是这样?以及如何避免这种情况?


正则表达式的解释:我想匹配 之间的内容,如果它们是 之间唯一不包括数字内容的内容。

匹配示例:

❰❮6❯⦓Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.⦔❱

不匹配示例:

❰❮6❯⦓Lorem ipsum dolor sit amet, consectetur adipiscing elit.⦔ Suspendisse gravida consectetur mauris, eget ornare velit consequat vitae.❱


我的PHP代码:

<?php
$subject = "Nulla imperdiet ❰❮6❯⦓“Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse gravida consectetur mauris,
         eget ornare velit consequat vitae.⦔❱❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱❰❮8❯⦓Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱❰❮9-10❯⦓Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱ eu euismod.";


$pattern = '#❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱#';
preg_match_all($pattern, $subject, $matches);
echo '<pre>';
print_r($matches);
echo '</pre>';    
?>

输出:

Array
(
    [0] => Array
        (
            [0] => ❰❮7❯⦓Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.⦔❱
            [1] => ❰❮8❯⦓Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.⦔❱
            [2] => ❰❮9-10❯⦓Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .⦔❱
        )

    [1] => Array
        (
            [0] => ❮7❯
            [1] => ❮8❯
            [2] => ❮9-10❯
        )

    [2] => Array
        (
            [0] => Morbi in quam id nulla facilisis vestibulum sit amet ornare est. Duis dolor erat, 
        porttitor at eleifend congue, lacinia vitae est. Phasellus ac sem ut velit fermentum porta at sit amet neque.
            [1] => Etiam in congue turpis. 
        Cras volutpat est mauris. Nulla imperdiet libero vitae metus semper, sit amet dictum lectus placerat. Aenean at venenatis libero.
            [2] => Aenean 
        luctus at nibh eget scelerisque. Phasellus vel consequat dui, eu euismod lacus. Nam id tellus tincidunt, tristique quam eu,
        cursus nulla. Suspendisse ac nibh lacinia, tempus enim quis, elementum nulla. .
        )

)

您匹配的是 unicode 字符,但您没有包含 unicode modifier,这意味着 unicode 字符将不会被视为它们实际的样子。

来自manual

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

要解决您的问题,只需将 u 附加到您的正则表达式:

$pattern = '#❰(❮\d+[\-\d]*❯)⦓([^⦔]*)⦔❱#u';
// Add the unicode modifier            ^