PCRE_UTF8 修改器极慢
PCRE_UTF8 Modifier Extremely Slow
出于某种原因,即使不使用多字节字符,只需将 PCRE_UTF8 修饰符添加到 preg_match()
的正则表达式输入中,即可大致减少 (x10) 执行时间。我不明白为什么会这样以及如何最好地减少时间。用来测试的脚本是:
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /u', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "u Modifier:\t".(($e-$s)/$i)."\n";
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "No Modifier:\t".(($e-$s)/$i)."\n";
结果是:
u Modifier: 2.5037050247192E-5
No Modifier: 2.4969577789307E-6
我试图查看这是否是一个已知的在线错误,但是 supposedly, it is not a problem with PHP。
这是什么原因造成的?更快地执行匹配1的最佳方法是什么?
1"the match" 指任意匹配。使用的示例只是一个最小的示例,显然可以用更好的方式进行匹配。
PCRE 在进行任何其他处理之前检查 UTF 有效性。
来自PCRE docs:
When the PCRE2_UTF
option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. The code unit offset to the offending character can be extracted from the match data block by calling pcre2_get_startchar()
, which is used for this purpose after a UTF error.
...
The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.
...
In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK
option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.
(注意:这些文档引用自 PCRE2,但 PCRE 行为相同)
不幸的是,我认为没有办法从 PHP 设置 PCRE2_NO_UTF_CHECK
选项。
无论如何,您的基准测试应该经过更多的迭代才有意义。您应该测量几秒钟的计算时间,以更好地了解此功能的影响。
出于某种原因,即使不使用多字节字符,只需将 PCRE_UTF8 修饰符添加到 preg_match()
的正则表达式输入中,即可大致减少 (x10) 执行时间。我不明白为什么会这样以及如何最好地减少时间。用来测试的脚本是:
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /u', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "u Modifier:\t".(($e-$s)/$i)."\n";
$s = microtime(true);
for ($i = 0; $i < 1000; $i++) {
preg_match('/ /', str_repeat(' ', 50000), $match);
}
$e = microtime(true);
echo "No Modifier:\t".(($e-$s)/$i)."\n";
结果是:
u Modifier: 2.5037050247192E-5
No Modifier: 2.4969577789307E-6
我试图查看这是否是一个已知的在线错误,但是 supposedly, it is not a problem with PHP。
这是什么原因造成的?更快地执行匹配1的最佳方法是什么?
1"the match" 指任意匹配。使用的示例只是一个最小的示例,显然可以用更好的方式进行匹配。
PCRE 在进行任何其他处理之前检查 UTF 有效性。
来自PCRE docs:
When the
PCRE2_UTF
option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. The code unit offset to the offending character can be extracted from the match data block by callingpcre2_get_startchar()
, which is used for this purpose after a UTF error....
The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.
...
In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance, for example in the case of a long subject string that is being scanned repeatedly. If you set the
PCRE2_NO_UTF_CHECK
option at compile time or at match time, PCRE2 assumes that the pattern or subject it is given (respectively) contains only valid UTF code unit sequences.
(注意:这些文档引用自 PCRE2,但 PCRE 行为相同)
不幸的是,我认为没有办法从 PHP 设置 PCRE2_NO_UTF_CHECK
选项。
无论如何,您的基准测试应该经过更多的迭代才有意义。您应该测量几秒钟的计算时间,以更好地了解此功能的影响。