如何将带有 cp1252 字符的 unicode 字符串转换为带有 Python 的 UTF-8?
How do I convert unicode string with cp1252 characters into UTF-8 with Python?
我正在通过 API 获取文本,其中 returns 个字符带有 windows 编码的撇号 (\x92):
> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>
我正在尝试将此字符串转换为 UTF-8,以便 returns:"There’s thirty days in June"
当我尝试解码或编码这个 unicode 字符串时,它抛出一个错误:
>>> title.decode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)
>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>
如果我将字符串初始化为纯文本然后对其进行解码,它会起作用:
>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June
我的问题是如何将我正在获取的 unicode 字符串转换为纯文本字符串以便我可以对其进行解码?
您的字符串似乎是 解码 latin1
(因为它是 unicode
类型)
- 要将其转换回原来的字节,您需要使用该编码 (
latin1
) 编码
- 然后要取回文本 (
unicode
),您必须使用正确的编解码器 解码 (cp1252
)
- 最后,如果您想获得
utf-8
字节,您必须使用 UTF-8
编解码器 编码。
在代码中:
>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June
取决于 API 是采用文本 (unicode
) 还是 bytes
,3. 可能不是必需的。
我正在通过 API 获取文本,其中 returns 个字符带有 windows 编码的撇号 (\x92):
> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>
我正在尝试将此字符串转换为 UTF-8,以便 returns:"There’s thirty days in June"
当我尝试解码或编码这个 unicode 字符串时,它抛出一个错误:
>>> title.decode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)
>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>
如果我将字符串初始化为纯文本然后对其进行解码,它会起作用:
>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June
我的问题是如何将我正在获取的 unicode 字符串转换为纯文本字符串以便我可以对其进行解码?
您的字符串似乎是 解码 latin1
(因为它是 unicode
类型)
- 要将其转换回原来的字节,您需要使用该编码 (
latin1
) 编码
- 然后要取回文本 (
unicode
),您必须使用正确的编解码器 解码 (cp1252
) - 最后,如果您想获得
utf-8
字节,您必须使用UTF-8
编解码器 编码。
在代码中:
>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June
取决于 API 是采用文本 (unicode
) 还是 bytes
,3. 可能不是必需的。