UTF-8 模式正则表达式中的非 ASCII 字符

Question

问题

尽管 PHP 手册声明：

"In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

为什么波斯数字与 "UTF-8 mode" 中的 \d 或 [[:digit:]] 匹配？

详细说明

在的回答者评论中提到，在正则表达式中，\d 不仅匹配 ASCII 数字 0 到 9，而且，对于例如，波斯数字 (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷).

上述问题被标记为 java，但也可以在 PHP 中观察到该行为。考虑到这一点，我写了以下 "test":

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);

结果数组 $capture 仅包含 5 .

上的匹配项

使用 u 修饰符打开 "UTF-8 mode" 和运行这个：

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);

结果 $capture 包含 ۳ 和 5 上的匹配项。

备注

此问题涉及 PHP 5.6.22（最新）
两个测试都是在明确使用 C 语言环境时执行的。

Answer 1

因为文档损坏了。不幸的是，这并不是唯一的地方。

PHP 在后台使用 PCRE 来实现其 preg_* 功能。因此，PCRE 的文档在那里是权威的。 PHP 的文档是基于 PCRE 的，但看起来你又发现了一个错误。

以下是您可以在 PCRE's docs（强调我的）中阅读的内容：

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:
[:alnum:]  becomes  \p{Xan}
[:alpha:]  becomes  \p{L}
[:blank:]  becomes  \h
[:digit:]  becomes  \p{Nd}
[:lower:]  becomes  \p{Ll}
[:space:]  becomes  \p{Xps}
[:upper:]  becomes  \p{Lu}
[:word:]   becomes  \p{Xwd}

如果你进一步挖掘 PHP 的文档，你会发现 the following:

u (PCRE_UTF8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

不幸的是，这是一个谎言。 PHP 中的 u 修饰符表示 PCRE_UTF8 | PCRE_UCP （UCP 代表 Unicode 字符属性）。 PCRE_UCP 标志改变了 \d、\w 等的含义，正如您从上面的文档中看到的那样。您的测试证实了这一点。

附带说明一下，不要从一种正则表达式风格推断另一种风格的属性。它并不总是有效（嘿，甚至 this chart 忘记了 PCRE_UCP 选项）。

UTF-8 模式正则表达式中的非 ASCII 字符

Non-ASCII characters in UTF-8 mode regular expression

php

regex

pcre

utf-8

character-class

问题

详细说明

备注

u (`PCRE_UTF8`)

UTF-8 模式正则表达式中的非 ASCII 字符

Non-ASCII characters in UTF-8 mode regular expression

php

regex

pcre

utf-8

character-class

问题

详细说明

备注

u (PCRE_UTF8)

u (`PCRE_UTF8`)