UnicodeDecodeError: 'ascii' codec can't decode byte (microsoft API)

Question

我正在尝试解析一些文本以用于编写一些包含特殊字符的文本（我的代码是用 Python 2.7 编写的）并继续出现相同的 ascii 错误。这是我目前所拥有的：

第一行：

# -*- coding: utf-8 -*-

然后在函数中指定要发送的参考文本：

self.referenceText=u"波构".encode('utf-8')
self.pronAssessmentParamsJson = "{\"ReferenceText\":\"%s\",\"GradingSystem\":\"FivePoint\"}" % self.referenceText;

不幸的是，当程序到达第 2 行时。那是在用其他文本解析文字 %s（特殊字符）时。错误信息是：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

（这些步骤取自官方示例：https://github.com/Azure-Samples/Cognitive-Speech-TTS/blob/master/PronunciationAssessment/Python/sample.py）

感谢您的帮助

Answer 1

我无法使用以下（已简化但运行可用）代码片段重现您的问题：

# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText

上述运行在 Python 2.7.17.

中没有异常

但是，我可以使用以下修改后的版本重现 UnicodeError（注意第二个字符串文字前的 u 前缀）：

# -*- coding: utf-8 -*-
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText

或者用这个：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
referenceText = u"波构".encode('utf-8')
pronAssessmentParamsJson = '{"ReferenceText":"%s"}' % referenceText

unicode_literals 指令的效果是所有字符串文字都被视为带有 u 前缀。

这里的问题是隐式强制转换：首先，您使用 UTF-8 显式地将 u"波构" 从类型 unicode 编码为类型 str。但是随后使用 % 的字符串格式将其强制转换回 unicode，因为如果其中一个操作数是 unicode，另一个也必须是。文字 u'{"ReferenceText":"%s"}' 是 unicode，因此 Python 尝试自动将 referenceText 的值从 str 转换为 unicode。

显然，自动转换发生在 .decode('ascii') 幕后，而不是 .decode('utf8') 或其他编解码器。当然，这失败得很惨：

>>> u"波构".encode('utf-8')
'\xe6\xb3\xa2\xe6\x9e\x84'
>>> u"波构".encode('utf-8').decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

一个解决方案是将手动编码推迟到稍后阶段，以避免隐式强制转换：

# -*- coding: utf-8 -*-
referenceText = u"波构"
pronAssessmentParamsJson = u'{"ReferenceText":"%s"}' % referenceText
pronAssessmentParamsJson = pronAssessmentParamsJson.encode('utf-8')

但是，由于您显然是在尝试序列化 JSON，因此您真的应该这样做：

>>> import json
>>> json.dumps({'ReferenceText': u"波构"})
'{"ReferenceText": "\u6ce2\u6784"}'

否则，如果 referenceText 包含例如，您很快就会运行陷入麻烦。引号或换行符。

UnicodeDecodeError: 'ascii' codec can't decode byte (microsoft API)

UnicodeDecodeError: 'ascii' codec can't decode byte (microsoft API)

unicode

ascii

character-encoding

azure

python-2.7