正则表达式匹配所有非字母，不包括变音符号 (python)

Question

我找到了这个优秀的指南：http://www.regular-expressions.info/unicode.html#category，它给出了一些关于如何将 non 字母与以下正则表达式匹配的提示：

\P{L}

但是这个正则表达式也会考虑非字母 à 编码为 U+0061 U+0300（如果我理解得很好的话）。例如，在 python 中使用 regex 模块，以下代码段：

all_letter_doc = regex.sub(r'\P{L}', ' ', doc)

将在 pur

中转换 purè

指南中提供了如何将所有字母与以下内容匹配：

\p{L}\p{M}*+

实际上我需要否定它，但我不知道如何获得它。

Answer 1

由于您使用的是 Python 2.x，因此您的 r'\P{L}' 是一个字节字符串，而您的输入是 Unicode。您需要将模式设为 Unicode 字符串。见 PyPi regex reference:

If neither the ASCII, LOCALE nor UNICODE flag is specified, it will default to UNICODE if the regex pattern is a Unicode string and ASCII if it’s a bytestring.

因此，您需要使用 ur'\P{L}' 和 u' ' 替换模式。

如果您想匹配除字母和变音符号以外的 1+ 个字符，您将需要 ur'[^\p{L}\p{M}]+' 正则表达式。

正则表达式匹配所有非字母，不包括变音符号 (python)

Regex match all non letters excluding diacritics (python)

python

regex

unicode

regex-negation

python-2.7