从解码字符串中删除 'surrogateescape' 个字符的当前习惯用法

Current idiom for removing 'surrogateescape' characters fron a decoded string

Armin Ronacher, http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/

If you for instance pass [the result of os.fsdecode() or equivalent] to a template engine you [sometimes get a UnicodeEncodeError] somewhere else entirely and because the encoding happens at a much later stage you no longer know why the string was incorrect. If you detect that error when it happens the issue becomes much easier to debug

Armin 建议一个函数

def remove_surrogate_escaping(s, method='ignore'):
    assert method in ('ignore', 'replace'), 'invalid removal method'
    return s.encode('utf-8', method).decode('utf-8')

Nick Coghlan, 2014, [Python-Dev] Cleaning up surrogate escaped strings

The current proposal on the issue tracker is to ... take advantage of the existing error handlers:

def convert_surrogateescape(data, errors='replace'):
    return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

That code is short, but semantically dense - it took a few iterations to come up with that version. (Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions. The standard name just makes it easier to look up when you come across it in a piece of code, and provides the option of optimising it later if it ever seems worth the extra work)

功能略有不同。第二个是在第一个的知识下写的。

自 Python 3.5 起,backslashreplace 错误处理程序现在可用于解码和编码。第一种方法不是为使用 backslashreplace 而设计的,例如解码字节 0xff 的错误将打印为“\udcff”。第二种方法旨在解决这个问题;它会打印“\xff”。

如果您不需要 backslashreplace,如果您不幸支持 Python < 3.5(包括多语言 2/3 代码,哎哟),您可能更喜欢第一个版本。

问题

还有更好的成语吗?还是我们仍然使用这个插入功能?

Nick 将 adding such a function 的问题提到了 codecs 模块。截至2019年该功能未添加,工单保持开放状态


最新评论说

msg314682 Nick Coghlan, 2018

A recent discussion on python-ideas also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data.

That includes a lone surrogate fixer: ftfy.fixes.fix_surrogates(text)

...

我觉得 ftfy 中的功能没有吸引力。该文档没有这么说,但它似乎旨在处理 surrogateescape 和...成为 CESU-8 或类似的解决方法的一部分?

Replace 16-bit surrogate codepoints with the characters they represent (when properly paired), or with � otherwise.