UTF-8 模式正则表达式中的非 ASCII 字符
Non-ASCII characters in UTF-8 mode regular expression
问题
尽管 PHP 手册声明:
为什么波斯数字与 "UTF-8 mode" 中的 \d
或 [[:digit:]]
匹配?
详细说明
在 的回答者评论中提到,在正则表达式中,\d
不仅匹配 ASCII 数字 0
到 9
,而且,对于例如,波斯数字 (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷
).
上述问题被标记为 java,但也可以在 PHP 中观察到该行为。考虑到这一点,我写了以下 "test":
$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);
结果数组 $capture
仅包含 5
.
上的匹配项
使用 u
修饰符打开 "UTF-8 mode" 和 运行 这个:
$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);
结果 $capture
包含 ۳
和 5
上的匹配项。
备注
- 此问题涉及 PHP 5.6.22(最新)
- 两个测试都是在明确使用
C
语言环境时执行的。
因为文档损坏了。不幸的是,这并不是唯一的地方。
PHP 在后台使用 PCRE 来实现其 preg_*
功能。因此,PCRE 的文档在那里是权威的。 PHP 的文档是基于 PCRE 的,但看起来你又发现了一个错误。
以下是您可以在 PCRE's docs(强调我的)中阅读的内容:
By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP
option is passed to pcre_compile()
, some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
如果你进一步挖掘 PHP 的文档,你会发现 the following:
u (PCRE_UTF8
)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_*
function to match nothing; an invalid pattern will trigger an error of level E_WARNING
. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
不幸的是,这是一个谎言。 PHP 中的 u
修饰符表示 PCRE_UTF8 | PCRE_UCP
(UCP 代表 Unicode 字符属性)。 PCRE_UCP
标志改变了 \d
、\w
等的含义,正如您从上面的文档中看到的那样。您的测试证实了这一点。
附带说明一下,不要从一种正则表达式风格推断另一种风格的属性。它并不总是有效(嘿,甚至 this chart 忘记了 PCRE_UCP
选项)。
问题
尽管 PHP 手册声明:
为什么波斯数字与 "UTF-8 mode" 中的 \d
或 [[:digit:]]
匹配?
详细说明
在 \d
不仅匹配 ASCII 数字 0
到 9
,而且,对于例如,波斯数字 (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷
).
上述问题被标记为 java,但也可以在 PHP 中观察到该行为。考虑到这一点,我写了以下 "test":
$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);
结果数组 $capture
仅包含 5
.
使用 u
修饰符打开 "UTF-8 mode" 和 运行 这个:
$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);
结果 $capture
包含 ۳
和 5
上的匹配项。
备注
- 此问题涉及 PHP 5.6.22(最新)
- 两个测试都是在明确使用
C
语言环境时执行的。
因为文档损坏了。不幸的是,这并不是唯一的地方。
PHP 在后台使用 PCRE 来实现其 preg_*
功能。因此,PCRE 的文档在那里是权威的。 PHP 的文档是基于 PCRE 的,但看起来你又发现了一个错误。
以下是您可以在 PCRE's docs(强调我的)中阅读的内容:
By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the
PCRE_UCP
option is passed topcre_compile()
, some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:[:alnum:] becomes \p{Xan} [:alpha:] becomes \p{L} [:blank:] becomes \h [:digit:] becomes \p{Nd} [:lower:] becomes \p{Ll} [:space:] becomes \p{Xps} [:upper:] becomes \p{Lu} [:word:] becomes \p{Xwd}
如果你进一步挖掘 PHP 的文档,你会发现 the following:
u (
PCRE_UTF8
)This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the
preg_*
function to match nothing; an invalid pattern will trigger an error of levelE_WARNING
. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
不幸的是,这是一个谎言。 PHP 中的 u
修饰符表示 PCRE_UTF8 | PCRE_UCP
(UCP 代表 Unicode 字符属性)。 PCRE_UCP
标志改变了 \d
、\w
等的含义,正如您从上面的文档中看到的那样。您的测试证实了这一点。
附带说明一下,不要从一种正则表达式风格推断另一种风格的属性。它并不总是有效(嘿,甚至 this chart 忘记了 PCRE_UCP
选项)。