regex.WORD 如何影响 \b 的行为？

Question

我正在使用 PyPI module regex 进行正则表达式匹配。它说

Default Unicode word boundary

The WORD flag changes the definition of a ‘word boundary’ to that of a default Unicode word boundary. This applies to \b and \B.

但似乎什么都没有改变:

>>> r1 = regex.compile(r".\b.", flags=regex.UNICODE)
>>> r2 = regex.compile(r".\b.", flags=regex.UNICODE | regex.WORD)
>>> r1.findall("русский  ελλανικα")
['й ', ' ε']
>>> r2.findall("русский  ελλανικα")
['й ', ' ε']

我没有观察到任何不同...?

Answer 1

有无 WORD 标志的区别在于单词边界的定义方式。

给出这个例子：

import regex

t = 'A number: 3.4 :)'

print(regex.search(r'\b3\b', t))
print(regex.search(r'\b3\b', t, flags=regex.WORD))

第一个将打印匹配项，而第二个 returns None，为什么？因为“Unicode 字界”包含了一套区分字界的规则，而默认的 python 字界将其定义为任何非 \w 字符（仍然是 Unicode 字母数字）。

在示例中，3.4 被 python 的默认单词边界分割，因为存在 \W 字符，即句点，因此它是一个单词边界。对于 Unicode 字边界，规则规定“禁止在“.”上打断”示例为“3.4”，因此该句点不被视为单词边界。

在此处查看所有 Unicode 字边界规则：https://unicode.org/reports/tr29/#Sentence_Boundary_Rules

结论：

它们都适用于 Unicode 或您的 LOCALE，但是 WORD 标志除了 \W 的空字符串之外还提供了一组额外的规则来区分单词边界，因为“单词被定义为单词字符序列 [\w]”。

regex.WORD 如何影响 \b 的行为？

How does regex.WORD affect the behavior of \b?

python

regex

unicode

word-boundary

python-regex