Python 无法使用 surrogateescape 进行编码
Python can't encode with surrogateescape
我在 Python (3.4) 中遇到 Unicode 代理编码问题:
>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed
如果我没记错的话,根据Python documentation:
'surrogateescape': On decoding, replace byte with individual surrogate
code ranging from U+DC80 to U+DCFF. This code will then be turned back
into the same byte when the 'surrogateescape' error handler is used
when encoding the data.
代码应该只生成源序列 (b'\xCC'
)。那么为什么会引发异常呢?
这可能与我的第二个问题有关:
Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded.
(来自 https://docs.python.org/3/library/codecs.html#standard-encodings)
据我所知,如果没有代理项对,就无法将某些代码点编码为 UTF-16。那么这背后的原因是什么?
进行此更改是因为 Unicode 标准 明确禁止此类编码。请参阅 issue #12892,但显然 surrogateescape
错误处理程序无法与 UTF-16 或 UTF-32 一起使用,因为这些编解码器与 ASCII 不兼容。
具体来说:
I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder
does not work as expected.
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff'
=> I expected '[\udc80\udcdc]'
.
得到的回应是:
Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. First, it can't represent the result of decoding b'\x00\xd8'
from utf-16-le or b'ABCD'
from utf-32*. This problem is worth separated issue (or even PEP) and discussion on Python-Dev.
我认为 surrogateescape
处理程序更适用于 UTF-8 数据;解码为 UTF-16 或 UTF-32 现在也可以使用它是一个很好的额外功能,但显然它不能在另一个方向上工作。
如果您使用 surrogatepass
(而不是 surrogateescape
),应该可以在 Python 3.
参见:https://docs.python.org/3/library/codecs.html#codec-base-classes(其中表示 surrogatepass
允许对代理项代码进行编码和解码(对于 utf
相关编码)。
我在 Python (3.4) 中遇到 Unicode 代理编码问题:
>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed
如果我没记错的话,根据Python documentation:
'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.
代码应该只生成源序列 (b'\xCC'
)。那么为什么会引发异常呢?
这可能与我的第二个问题有关:
Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded.
(来自 https://docs.python.org/3/library/codecs.html#standard-encodings)
据我所知,如果没有代理项对,就无法将某些代码点编码为 UTF-16。那么这背后的原因是什么?
进行此更改是因为 Unicode 标准 明确禁止此类编码。请参阅 issue #12892,但显然 surrogateescape
错误处理程序无法与 UTF-16 或 UTF-32 一起使用,因为这些编解码器与 ASCII 不兼容。
具体来说:
I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected.
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore') '[]' >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace') '[�]' >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape') '[\udc80\udcdc\uffff'
=> I expected
'[\udc80\udcdc]'
.
得到的回应是:
Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. First, it can't represent the result of decoding
b'\x00\xd8'
from utf-16-le orb'ABCD'
from utf-32*. This problem is worth separated issue (or even PEP) and discussion on Python-Dev.
我认为 surrogateescape
处理程序更适用于 UTF-8 数据;解码为 UTF-16 或 UTF-32 现在也可以使用它是一个很好的额外功能,但显然它不能在另一个方向上工作。
如果您使用 surrogatepass
(而不是 surrogateescape
),应该可以在 Python 3.
参见:https://docs.python.org/3/library/codecs.html#codec-base-classes(其中表示 surrogatepass
允许对代理项代码进行编码和解码(对于 utf
相关编码)。