PHP Intl SpoofChecker::isSuspicious() 产生误报

PHP Intl SpoofChecker::isSuspicious() yields false positive

案例

似乎 Spoofchecker from the Intl 扩展产生误报:

<?php  // 7.0 on linux
// File encoding of this script is UTF-8 (thus without BOM)
$sDefaultLocale = (new \Locale)->getDefault();
$oSpoofchecker = new \Spoofchecker;
$oSpoofchecker->setAllowedLocales($sDefaultLocale);
$sText = 'abc';  // US-ASCII
header('Content-Type: text/plain');
print
    'Default locale: ' . $sDefaultLocale . PHP_EOL
  . 'Byte length: ' . strlen($sText) . PHP_EOL  // US-ASCII check
  . 'Text "' . $sText . '" '
  . ($oSpoofchecker->isSuspicious($sText, $sError) ? 'IS' : 'IS NOT')
  . ' suspicious' . PHP_EOL
  . 'Spoofchecker internal error information:' . PHP_EOL;
var_dump($sError);

结果

Default locale: en_US_POSIX
Byte length: 3
Text "abc" IS suspicious
Spoofchecker internal error information:
NULL    

预期结果

Text "abc" IS NOT suspicious

这是因为 abc 是 US-ASCII,据推测 应该是 en_US_POSIX 的默认值。另外 PHP Spoofchecker class 提到如果使用任何非英文字符, Spoofchecker::isSuspicious() 的 return 代码将是 TRUE,这里不是这种情况。

可能原因

documentation of Spoofchecker::setAllowedLocales() is currently close to non-existent, the argument list does not contain a list of possible values. One can only assume that it must be compatible with that of Locale。文档内容如下:

Locales are identified using RFC 4646 language tags (which use hyphen, not underscore)

Locale 对默认语言环境使用下划线而不是连字符的测试结果相矛盾。但是当 运行 另一个测试 $oSpoofchecker->setAllowedLocales('en-US'); 结果保持不变。

问题

如何正确使用Spoofchecker::isSuspicious()

您可以使用 Spoofchecker::setChecks(int $checks) 指定字符串的验证方式。

$checks 常量列在 Spoofchecker class documentation, and described by a user in comments.

可以使用SpoofChecker::CHAR_LIMIT(或多个常数的组合,例如:SpoofChecker::CHAR_LIMIT|Spoofchecker::INVISIBLE):

CHAR_LIMIT: Check that an identifier contains only characters from a specified set of acceptable characters.
INVISIBLE: Check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark.

$sDefaultLocale = (new \Locale)->getDefault();
$oSpoofchecker = new \Spoofchecker;
$oSpoofchecker->setAllowedLocales($sDefaultLocale);
$oSpoofchecker->setChecks(SpoofChecker::CHAR_LIMIT);
$sText = 'abc';  // US-ASCII
header('Content-Type: text/plain');
print
    'Default locale: ' . $sDefaultLocale . PHP_EOL
  . 'Byte length: ' . strlen($sText) . PHP_EOL  // US-ASCII check
  . 'Text "' . $sText . '" '
  . ($oSpoofchecker->isSuspicious($sText, $sError) ? 'IS' : 'IS NOT')
  . ' suspicious' . PHP_EOL
  . 'Spoofchecker internal error information:' . PHP_EOL;
var_dump($sError);

将输出:

Default locale: en_US_POSIX
Byte length: 3
Text "abc" IS NOT suspicious
Spoofchecker internal error information:
NULL

使用 isSuspicious() documentation 中的示例,文本 Рaypal.com(第一个字母来自 Cyrylic),方法 returns:

Text "Рaypal.com" IS suspicious 

PHP 的 Intl 扩展只是 ICU 的包装器,其 Spoofchecker 从 ICU 版本 58 开始减少了误报。

来自他们的bug tracker

ICU 58 reflects the latest Unicode update, which deprecates the Whole-Script Confusables (WSC) check and Mixed-Script Confusables (MSC) check and is available at ​http://www.unicode.org/L2/L2016/16229-revising-uts-39-algorithm.pdf.

Under ICU 57, the checks (WSC and MSC) had the following pitfalls:

  1. They did not restrict themselves to the set of characters specified by SpoofChecker#setAllowedChars or SpoofChecker#setAllowedLocales.
  2. They did not correctly handle confusables containing multiple skeleton characters, like 'æ' to 'ae'.
  3. WSC exhibited a high false-positive rate, especially as more and more entries were being added to confusables.txt.
  4. All strings failing MSC also fail Restriction Level. (Your string, "goօgle", is an example.)

With these pitfalls in mind, WSC and MSC were removed from ICU 58.

强调我的。 WSC 检查是你的字符串 failing。 (请注意,它通过了 ICU 版本为 58.1 及更高版本的地方,因为该检查已被完全删除。)

关于如何正确使用Spoofchecker::isSuspicious():

  1. 升级 ICU(总的来说这是个好主意)或
  2. 使用 中所述的 Spoofchecker::setChecks() 并省略 WSC 检查 Spoofchecker::WHOLE_SCRIPT_CONFUSABLE(涵盖这种情况)和 MSC 检查 Spoofchecker::MIXED_SCRIPT_CONFUSABLE(同样从最近的版本。)