在 Python 中更正 len() 32 位 unicode 字符串

Question

我在 Python 2.7 中遇到 32 位 unicode 字符串的问题。一个简单的声明，例如：

s = u'\U0001f601'
print s

将在 shell 中打印一个漂亮的（笑脸）（如果 shell 支持 unicode）。问题是当我尝试时：

print len(s), s.encode('latin-1', errors='replace')

我对不同的平台有不同的反应。在 Linux 中，我得到：

1 ?

但在 Mac 中，我得到：

2 ??

字符串声明是否正确？这是 Python 中 Mac 的错误吗？

Answer 1

OS X Python 已在 OS X 上用 UCS-2 (really UTF-16) support versus UCS-4 support for Linux. This means that a surrogate pair with a length of 2 characters is being used to represent the SMP 字符编译。

在 Python 中更正 len() 32 位 unicode 字符串

Correct len() 32-bit unicode strings in Python

unicode

python-2.7