Java 正则表达式不匹配 ascii 范围之外，行为不同于 python 正则表达式

Question

我想像 sklearn 的 CountVectorizer 一样从文档中过滤字符串。它使用以下正则表达式：(?u)\b\w\w+\b。此 java 代码的行为方式应相同：

Pattern regex = Pattern.compile("(?u)\b\w\w+\b");
Matcher matcher = regex.matcher("this is the document.!? äöa m²");

while(matcher.find()) {
    String match = matcher.group();
    System.out.println(match);
}

但这并没有像在 python 中那样产生所需的输出：

this
is
the
document
äöa
m²

它输出：

this
is
the
document

我怎样才能像 python RegeEx 那样包含非 ASCII 字符？

Answer 1

还剩下一步：您需要指定 \w 也包括 unicode 字符。 Pattern.UNICODE_CHARACTER_CLASS 救援：

    Pattern regex = Pattern.compile("(?u)\b\w\w+\b", Pattern.UNICODE_CHARACTER_CLASS);
                                                   // ^^^^^^^^^^
    Matcher matcher = regex.matcher("this is the document.!? äöa m²");

    while(matcher.find()) {
        String match = matcher.group();
        System.out.println(match);
    }

Answer 2

正如 Wiktor 在评论中所建议的，您可以使用 (?U) 来打开标志 UNICODE_CHARACTER_CLASS。虽然这确实允许匹配 äöa，但这仍然不匹配 m²。那是因为 UNICODE_CHARACTER_CLASS 和 \w 不能将 ² 识别为有效的字母数字字符。作为 \w 的替代品，您可以使用 [\pN\pL_]。这匹配 Unicode 数字 \pN 和 Unicode 字母 \pL（加上 _）。 \pN Unicode 字符 class 包括 \pNo 字符 class，后者包括 Latin 1 增补 - Latin-1 标点和符号字符 class（包括 ²³¹）。或者，您可以将 \pNo Unicode 字符 class 添加到具有 \w 的字符 class。这意味着以下正则表达式正确匹配您的字符串：

[\pN\pL_]{2,}         # Matches any Unicode number or letter, and underscore
(?U)[\w\pNo]{2,}      # Uses UNICODE_CHARACTER_CLASS so that \w matches Unicode.
                      # Adds \pNo to additionally match ²³¹

那么为什么 \w 不匹配 Java 中的 ² 而在 Python 中匹配？

Java的解读

查看OpenJDK 8-b132's Pattern implementation，我们得到以下信息（我删除了与回答问题无关的信息）：

Unicode support

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression, when UNICODE_CHARACTER_CLASS flag is specified.

\w A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]

太棒了！现在，当使用 (?U) 标志时，我们有一个定义用于 \w。将这些 Unicode 字符 classes 插入 this amazing tool 将告诉您每个 Unicode 字符 classes 的确切匹配。在不让这个 post 超长的情况下，我会继续告诉你以下两个 class 都不匹配 ²:

\p{Alpha}
\p{gc=Mn}
\p{gc=Me}
\p{gc=Mc}
\p{Digit}
\p{gc=Pc}
\p{IsJoin_Control}

Python的解读

那么，当 u 标志与 \w 一起使用时，为什么 Python 匹配 ²³¹？这个很难找到，但我深入研究了 Python's source code (I used Python 3.6.5rc1 - 2018-03-13)。在删除了很多关于如何调用的绒毛之后，基本上发生了以下情况：

\w 定义为 CATEGORY_UNI_WORD，然后以 SRE_ 为前缀。 SRE_CATEGORY_UNI_WORD 呼叫 SRE_UNI_IS_WORD(ch)
SRE_UNI_IS_WORD 定义为 (SRE_UNI_IS_ALNUM(ch) || (ch) == '_').
SRE_UNI_IS_ALNUM 调用 Py_UNICODE_ISALNUM，后者定义为 (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch)).
这里重要的是Py_UNICODE_ISDECIMAL(ch)，定义为Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch)。

现在，让我们来看看方法_PyUnicode_IsDecimalDigit(ch):

int _PyUnicode_IsDecimalDigit(Py_UCS4 ch)
{
    if (_PyUnicode_ToDecimalDigit(ch) < 0)
        return 0;
    return 1;
}

我们可以看到，这个方法returns 1 if _PyUnicode_ToDecimalDigit(ch) < 0。那么 _PyUnicode_ToDecimalDigit 是什么样子的呢？

int _PyUnicode_ToDecimalDigit(Py_UCS4 ch)
{
    const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);

    return (ctype->flags & DECIMAL_MASK) ? ctype->decimal : -1;
}

很好，所以基本上，如果字符的 UTF-32 编码字节有 DECIMAL_MASK 标志，这将评估为真，并且将返回大于或等于 0 的值。

² 的 UTF-32 编码字节值是 0x000000b2，我们的标志 DECIMAL_MASK 是 0x02。 0x000000b2 & 0x02 计算结果为真，因此 ² 被认为是 python 中的有效 Unicode 字母数字字符，因此带有 u 标志的 \w 匹配 ² .

Java 正则表达式不匹配 ascii 范围之外，行为不同于 python 正则表达式

Java regex doesnt match outside of ascii range, behaves different than python regex

java

regex

pattern-matching

scikit-learn

countvectorizer

Java的解读

Unicode support

Python的解读