如何检测一个字符串是否已经是 utf8 编码的？

Question

我有一些这样的字符串：

u'ThaÃÂ¯lande'

这是 "Thaïlande"，我不知道它是如何编码的，但我需要将它恢复为 "Thaïlande"，然后 URL 对其进行编码。

有没有办法猜测字符串是否已经用 Python 2 编码？

Answer 1

你有所谓的Mojibake。您可以使用统计分析来查看在典型的 UTF-8 字节组合中是否存在数量异常的 Latin-1 字符，或者其中是否存在任何 CP1252 特定字符。

已经有一个包可以为您完成此操作并且如果检测到 Mojibake 则修复损坏：ftfy:

The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code.

和

The ftfy.fix_encoding() function will look for evidence of mojibake and, when possible, it will undo the process that produced it to get back the text that was supposed to be there.

Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.

如何检测一个字符串是否已经是 utf8 编码的？

How to detect if a string is already utf8-encoded?

python

character-encoding

mojibake

python-2.7