正则表达式 - 匹配一个字符及其所有变音符号（又名重音不敏感）

Question

我正在尝试用正则表达式匹配一个字符及其所有可能的变音符号（又名重音不敏感）。我能做的当然是：

re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é")

但这不是通用的解决方案。如果我使用像 \pL 这样的 unicode 类别，我无法将匹配减少到特定字符，在这种情况下 e.

Answer 1

实现预期目标的解决方法是先使用 unidecode 去除所有变音符号，然后再次匹配常规 e

re.match(r"^e$", unidecode("é"))

或者在这个简化的例子中

unidecode("é") == "e"

另一种不依赖于 unidecode 库、保留 unicode 并提供更多控制的解决方案是手动删除变音符号，如下所示：

使用 unicodedata.normalize() 将您的输入字符串转换为正常形式 D（用于分解），确保像 é 这样的复合字符被转换为分解形式 e\u301 (e + COMBINING口音）

>>> input = "Héllô"
>>> input
'Héllô'
>>> normalized = unicodedata.normalize("NFKD", input)
>>> normalized
'He\u0301llo\u0302'

然后，删除属于类别Mark, Nonspacing（简称Mn）的所有代码点。这些都是本身没有宽度的字符，只是装饰前一个字符。使用 unicodedata.category() 来确定类别。

>>> stripped = "".join(c for c in normalized if unicodedata.category(c) != "Mn")
>>> stripped
'Hello'

结果可以用作正则表达式匹配的来源，就像上面的 unidecode-example 一样。这是一个函数的全部内容：

def remove_diacritics(text):
    """
    Returns a string with all diacritics (aka non-spacing marks) removed.
    For example "Héllô" will become "Hello".
    Useful for comparing strings in an accent-insensitive fashion.
    """
    normalized = unicodedata.normalize("NFKD", text)
    return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

正则表达式 - 匹配一个字符及其所有变音符号（又名重音不敏感）

Regex - match a character and all its diacritic variations (aka accent-insensitive)

python

regex

diacritics

python-3.x

accent-insensitive