Python 2 对比 Python 3 正则表达式匹配行为

Question

Python 3

import re

P = re.compile(r'[\s\t]+') 
re.sub(P, ' ', '\xa0 haha')
' haha'

Python 2

import re

P = re.compile(r'[\s\t]+')
re.sub(P, u' ', u'\xa0 haha')
u'\xa0 haha'

我想要 Python 3 行为，但在 Python 2 代码中。为什么正则表达式模式无法匹配 space-like 代码点，如 Python 2 中的 \xa0，但在 Python 3 中正确匹配这些代码点？

Answer 1

使用re.UNICODE标志：

>>> import re
>>> P = re.compile(r'[\s\t]+', flags=re.UNICODE)
>>> re.sub(P, u' ', u'\xa0 haha')
u' haha'

没有标志，只匹配ASCII空格； \xa0 不是 ASCII 标准的一部分（它是 Latin-1 代码点）。

re.UNICODE标志是Python3中的默认标志；如果你想要 Python 2 (bytestring) 行为，请使用 re.ASCII。

请注意，在字符 class 中包含 \t 是没有意义的； \t 已经是 \s class 的一部分，因此下面将匹配完全相同的输入：

P = re.compile(r'\s+', flags=re.UNICODE)

Python 2 vs Python 3 Regex matching behavior