标准化 composite/decomposable/variable-length 个字符 (unicode/python3.4)

Question

我偶然发现了 http://mortoray.com/2013/11/27/the-string-type-is-broken/

令我恐惧的是...

print(len('noe\u0308l')) # returns 5 not 4

不过我发现 , Normalizing Unicode

from unicodedata import normalize
print(len(unicodedata.normalize('NFC','noe\u0308l'))) # returns 4

但是我该怎么处理薛定谔的猫呢？

print(len('')) # returns 4 not 2

（附带问题：在我的文本编辑器中，当我尝试保存时，我得到一个 "utf-8 codec can't encode character x in position y: surrogates not allowed" 但在命令提示符下我可以粘贴和运行带有这些字符的代码，我假设它是因为猫存在于不同的量子水平 (SMP) 但我该如何将它们归一化？）

我还应该做些什么来确保所有字符都算作“1”吗？

Answer 1

为了在任何版本的 Python 上保持一致的代码点，编码为 UTF-32 并将字节数除以 4。

print(len(unicodedata.normalize('NFC','noe\u0308l').encode('utf-32le')) / 4)
print(len('\U0001f638\U0001f63e'.encode('utf-32le')) / 4)

Answer 2

您的编辑器正在生成 surrogate pairs，而不是实际的代码点，这就是为什么您也会收到该警告。使用：

'\U0001f638\U0001f63e'

在不借助代理人的情况下定义猫。

如果您确实有一个带有代理项的字符串，您可以通过 UTF-16 重新编码这些并允许使用 'surrogatepass' 错误处理程序对代理项进行编码：

>>> # \U0001f638 is \ud83d\ude38 when using UTF-16 surrogates
...
>>> '\ud83d\ude38'.encode('utf16', 'surrogatepass').decode('utf16')
''
>>> len(_)
1

来自Error Handlers documentation：

'surrogateescape'
On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)

标准化 composite/decomposable/variable-length 个字符 (unicode/python3.4)

Normalize composite/decomposable/variable-length characters (unicode/python3.4)

python

unicode

unicode-normalization

python-3.x

python-unicode