encode/decode in python 3 中的非 ascii 字符问题

Question

我正在尝试使用 python3 unicode_escape 在我的字符串中转义 \n ，但挑战是整个字符串中存在非 ascii 字符，如果我使用 utf8 来使用 unicode_escape 编码然后解码字节，然后特殊字符会出现乱码。有没有办法让 \n 换行转义而不混淆特殊字符？

s = "hello\nworld└--"
print(s.encode('utf8').decode('unicode_escape'))

Expected Result:
hello
world└--

Actual Result:
hello
worldâ--

Answer 1

尝试删除第二个转义反斜杠并使用 utf8 解码：

>>> s = "hello\nworld└--"
>>> print(s.encode('utf8').decode('utf8'))
hello
world└--

Answer 2

我认为您遇到的问题是 unicode_escape 在 Python 3.3 中被弃用并且它似乎假设您的代码是 'latin-1' 因为它是使用的原始编解码器在 unicode_excape 函数中...

查看 the python documentation for codecs 我们看到 Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default. 告诉我们 unicode_escape 假定您的文本是 ISO Latin-1。因此，如果我们运行您使用 latin1 编码的代码，我们会收到此错误：

s.encode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)

并且 unicode 字符错误是 '\u2514'，转换后是 '└' 最简单的表达方式是该字符不能在 Latin-1 字符串中使用，因此您会得到一个不同的字符.

我也认为指出在你的字符串中你有 '\n' 而不仅仅是 '\n' 额外的反斜杠意味着这个符号不是回车 return 而是它是正确的忽略反斜杠表示忽略 '\n'。也许尝试不使用 \n...

Answer 3

正如用户 wowcha 所观察到的，unicode-escape 编解码器采用 latin-1 编码，但您的字符串包含一个不可编码为 latin-1.

的字符

>>> s = "hello\nworld└--"
>>> s.encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)

将字符串编码为 utf-8 解决了编码问题，但在从 unicode-escape

解码时会导致 mojibake

解决方案是在编码时使用 backslashreplace 错误处理程序。这会将问题字符转换为可以编码为 latin-1 的转义序列，并且在从 unicode-escape.

解码时不会被破坏

>>> s.encode('latin-1', errors='backslashreplace')
b'hello\nworld\u2514--'

>>> s.encode('latin-1', errors='backslashreplace').decode('unicode-escape')
'hello\nworld└--'

>>> print(s.encode('latin-1', errors='backslashreplace').decode('unicode-escape'))
hello
world└--

encode/decode in python 3 中的非 ascii 字符问题

Issue in encode/decode in python 3 with non-ascii character

python

encoding

non-ascii-characters

python-3.x