Python unicode 索引显示不同的字符

Question

我在 "narrow" 构建的 Python 2.7.10 中有一个包含 Unicode 字符的 Unicode 字符串。我试图将该 Unicode 字符用作字典中的查找，但是当我索引字符串以获取最后一个 Unicode 字符时，它 returns 一个不同的字符串：

>>> s = u'Python is fun \U0001f44d'
>>> s[-1]
u'\udc4d'

为什么会这样，如何从字符串中检索 '\U0001f44d'？

编辑：unicodedata.unidata_version 是 5.2.0，sys.maxunicode 是 65535。

Answer 1

看起来您的 Python 2 构建使用代理来表示基本多语言平面之外的代码点。参见例如一些背景知识。

我的建议是尽快切换到 Python 3 以处理任何涉及字符串处理的问题。

Answer 2

A Python 2 "narrow" 构建使用 UTF-16 存储 Unicode 字符串（所谓的 leaky abstraction，因此代码点 >U+FFFF 是两个 UTF 代理项。要检索代码点，您必须同时获得前导和尾随代理项：

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]     # Just the trailing surrogate
u'\udc4d'
>>> s[-2:]    # leading and trailing
u'\U0001f44d'

切换到Python 3.3+，问题已经解决，Unicode字符串中Unicode代码点的存储细节不会暴露：

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]   # code points are stored in Unicode strings.
'\U0001f44d'

Python unicode 索引显示不同的字符

Python unicode indexing shows different character

python

unicode

ucs2

surrogate-pairs

python-2.7