连接字符串时出现 UnicodeDecodeError

Question

我有以下 Python 2.7 脚本：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower()

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result

如您所见，我首先找出“8.8.8.8”IP 所在的国家/地区（这个 returns“我们”- 见下文）然后我将一个短字符串连接到它包含一些俄语字符。

结果：

# ./script.py
us
Traceback (most recent call last):
   File "./script.py", line 12, in <module>
    result += "Роман"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

现在，如果我改为尝试以下操作

#!/usr/bin/python
# -*- coding: utf-8 -*-

result = "us"
print result
result += "Роман"
print result

然后一切正常:

./script.py 
us
usРоман

显然，'ret_country_iso()' 函数 returns 与字面上的“us”字符串有所不同，我的 Python 太差了。

如何更正上述内容？

编辑：根据 snakecharmerb 的建议，以下作品：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import geoip2.database

def ret_country_iso(ip):
    reader = geoip2.database.Reader('/usr/local/geoip/GeoLite2-Country.mmdb')
    response = reader.country(ip)
    return response.country.iso_code.lower().encode('utf-8')

result = ret_country_iso("8.8.8.8")
print result
result += "Роман"
print result

Answer 1

Python2没有严格区分unicode和bytes，所以两种类型拼接的结果不一致：

u'abc' + 'def'

成功，但是

u'US' + 'Роман'

导致异常。通常的方法——“Unicode 三明治”模式——是在应用程序的边缘解码和编码字符串类型的数据，并且只在应用程序内使用 unicode（对于主要处理字节的应用程序，采用反向模式）。

因此，在组合 str 和 unicode 实例时，您可以选择以下任一选项：

# unicode result
u'US ' + 'Роман'.decode('utf-8')

# str result
u'US '.encode('utf-8') + 'Роман'

但关键是要在整个代码中保持一致，否则最终会出现很多错误。

Python 3 对两种类型的区分更加严格；如果可能的话，你应该考虑使用它来更好地处理 unicode，因为 Python 2 不再受支持。

连接字符串时出现 UnicodeDecodeError

UnicodeDecodeError when concatenating strings

python

character-encoding

python-2.7

python-unicode