Python 的 re.sub returns 数据的 unicode 编码错误

Question

>>> re.sub('\w', '', 'абвгдеёжз')
'\x01\x01\x01\x01\x01\x01\x01\x01\x01'

为什么re.subreturn数据是这种格式？在这种情况下，我希望它 return 未更改的字符串 'абвгдеёжз' 。将字符串更改为 u'абвгдеёжз' 或传递 flags=re.U 不会执行任何操作。

Answer 1

因为 '' 是代码点为 1 的字符（其 repr 形式为 '\x01'）。根据 string literals. Even if you did escape it, such as in r'' or '\1', reference 1 isn't the right number; you need parenthesis to define groups. r'\g<0>' would work as described in the re.sub documentation 上的规则，re.sub 从未见过您的反斜杠。

Answer 2

也许你的意思是：

>>>> re.sub('(\w)', r'', 'абвгдеёжз')
'абвгдеёжз'

Python 的 re.sub returns 数据的 unicode 编码错误

Python's re.sub returns data in wrong encoding from unicode

python

regex

replace

python-3.x

python-unicode