Python 在具有或不具有非 ascii 字符的 unicode 变量中解码

Question

一个简单的例子：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import traceback

e_u = u'abc'
c_u = u'中国'

print sys.getdefaultencoding()
try:
    print e_u.decode('utf-8')
    print c_u.decode('utf-8')
except Exception as e:
    print traceback.format_exc()

reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
try:
    print e_u.decode('utf-8')
    print c_u.decode('utf-8')
except Exception as e:
    print traceback.format_exc()

输出：

ascii
abc
Traceback (most recent call last):
  File "test_codec.py", line 15, in <module>
    print c_u.decode('utf-8')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

utf-8
abc
中国

想彻底了解python中的编解码器的时候，有些问题困扰了我几天，想确定自己的想法是对的：

在ascii默认编码下，u'abc'.decode('utf-8')没有错误，但是u'中国'.decode('utf-8')有错误。

我想什么时候做u'中国'.decode('utf-8')，Python检查发现u'中国'是unicode，所以它尝试做u'中国'.encode(sys.getdefaultencoding())，这会导致问题，并出现异常是UnicodeEncodeError，不是解码时出错。

但 u'abc' 与 'abc' ( < 128) 具有相同的代码点，因此没有错误。

在Python2.x中，python内部如何存储变量值？如果字符串中的所有字符 < 128，则视为 ascii，如果 > 128，则视为 utf-8?

In [4]: chardet.detect('abc')
Out[4]: {'confidence': 1.0, 'encoding': 'ascii'}

In [5]: chardet.detect('abc中国')
Out[5]: {'confidence': 0.7525, 'encoding': 'utf-8'}

In [6]: chardet.detect('中国')
Out[6]: {'confidence': 0.7525, 'encoding': 'utf-8'}

Answer 1

简答

您必须使用 encode()，否则就不要使用它。不要将 decode() 与 unicode 字符串一起使用，那是没有意义的。此外，sys.getdefaultencoding() 在这里没有任何帮助。

长答案，第 1 部分：如何正确完成？

如果你定义：

c_u = u'中国'

那么 c_u 已经是一个 unicode 字符串，也就是说，它已经被 Python 解释器从字节串（您的源文件）解码为 unicode 字符串，使用您的 -*- coding: utf-8 -*-声明。

如果你执行：

print c_u.encode()

您的字符串将被编码回 UTF-8，并且该字节字符串被发送到标准输出。请注意，这通常会自动发生，因此您可以将其简化为：

print c_u

长答案，第 2 部分：c_u.decode() 有什么问题？

如果执行c_u.decode()，Python将

尝试将您的对象（即您的 unicode 字符串）转换为字节字符串
尝试将该字节字符串解码为 unicode 字符串

请注意，如果您的对象首先是 unicode 字符串，那么这没有任何意义 - 您只需将其来回转换即可。但是为什么会失败呢？好吧，这是 Python 的一个奇怪功能，第一步 (1.)，即任何从 unicode 字符串到字节字符串的隐式转换，通常使用 sys.getdefaultencoding( )，它又默认为 ASCII 字符集。也就是说，

c_u.decode()

大致翻译为：

c_u.encode(sys.getdefaultencoding()).decode()

这就是它失败的原因。

请注意，虽然您可能想更改默认编码，但不要忘记其他第三方库可能包含类似问题，如果默认编码与 ASCII 不同，则可能会中断。

话虽如此，我坚信如果他们一开始就没有定义 unicode.decode()，Python 会更好。 Unicode 字符串已经解码，没有必要再次解码它们，尤其是 Python 的方式。

Python 在具有或不具有非 ascii 字符的 unicode 变量中解码

Python decode in unicode variable with non-ascii character or without

python

unicode

encoding

utf-8

ascii