如何找出内部字符串编码？

Question

从PEP 393了解到Python在存储字符串时可以在内部使用多种编码：latin1、UCS-2、UCS-4。是否有可能找出用于存储特定字符串的编码，例如在交互式解释器中？

Answer 1

从 Python 层测试它的唯一方法（无需通过 ctypes 或 Python 扩展模块手动处理对象内部）是检查序数字符串中最大字符的值，决定字符串存储为ASCII/latin-1、UCS-2还是UCS-4。解决方案类似于：

def get_bpc(s):
    maxordinal = ord(max(s, default='[=10=]'))
    if maxordinal < 256:
        return 1
    elif maxordinal < 65536:
        return 2
    else:
        return 4

您实际上不能依赖 sys.getsizeof，因为对于非 ASCII 字符串（甚至每个适合 latin-1 范围的字符串一个字节），该字符串可能有也可能没有填充字符串的 UTF-8 表示，并向其添加额外字符和比较大小等技巧实际上可以显示大小减少，并且它实际上可能发生 "at a distance"，所以您不直接对您正在检查的字符串中缓存的 UTF-8 格式的存在负责。例如：

>>> e = 'é'
>>> sys.getsizeof(e)
74
>>> sys.getsizeof(e + 'a')
75
>>> class é: pass  # One of several ways to trigger creation/caching of UTF-8 form
>>> sys.getsizeof(e)
77  # !!! Grew three bytes even though it's the same variable
>>> sys.getsizeof(e + 'a')
75  # !!! Adding a character shrunk the string!

Answer 2

找出 CPython 对特定 unicode 字符串使用哪种确切内部编码的一种方法是查看实际 (CPython) 对象。

根据 PEP 393 (Specification 部分），所有 unicode 字符串对象都以 PyASCIIObject:

开头

typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
      unsigned int interned:2;
      unsigned int kind:2;
      unsigned int compact:1;
      unsigned int ascii:1;
      unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

字符大小存储在 kind 位字段中，如 PEP 中所述，以及 code comments in unicodeobject:

00 => str is not initialized (data are in wstr)
01 => 1 byte (Latin-1)
10 => 2 byte (UCS-2)
11 => 4 byte (UCS-4);

我们得到id(string)字符串的地址后，可以使用ctypes模块读取对象的字节（和kind字段）：

import ctypes
mystr = "x"
first_byte = ctypes.c_uint8.from_address(id(mystr)).value

从对象的开始到 kind 的偏移量是 PyObject_HEAD + Py_ssize_t length + Py_hash_t hash，它又是 Py_ssize_t ob_refcnt + 指向 [= 的指针27=] + Py_ssize_t length + 哈希类型的另一个指针的大小：

offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)

（在 x64 上是 32）

全部加在一起：

import ctypes

def bytes_per_char(s):
    offset = 2 * ctypes.sizeof(ctypes.c_ssize_t) + 2 * ctypes.sizeof(ctypes.c_void_p)
    kind = ctypes.c_uint8.from_address(id(s) + offset).value >> 2 & 3
    size = {0: ctypes.sizeof(ctypes.c_wchar), 1: 1, 2: 2, 3: 4}
    return size[kind]

给出：

>>> bytes_per_char('test')
1
>>> bytes_per_char('đžš')
2
>>> bytes_per_char('')
4

注意我们必须处理 kind == 0 的特殊情况，因为字符类型恰好是 wchar_t（16 位或 32 位，具体取决于平台）。

Answer 3

有一个 CPython C API 函数用于 unicode 对象的种类：PyUnicode_KIND。

如果您有 Cython 和 IPython¹，您可以轻松访问该功能：

In [1]: %load_ext cython
   ...:

In [2]: %%cython
   ...:
   ...: cdef extern from "Python.h":
   ...:     int PyUnicode_KIND(object o)
   ...:
   ...: cpdef unicode_kind(astring):
   ...:     if type(astring) is not str:
   ...:         raise TypeError('astring must be a string')
   ...:     return PyUnicode_KIND(astring)

In [3]: a = 'a'
   ...: b = 'Ǧ'
   ...: c = ''

In [4]: unicode_kind(a), unicode_kind(b), unicode_kind(c)
Out[4]: (1, 2, 4)

其中1代表latin-1，2和4分别代表UCS-2和UCS-4。

然后您可以使用字典将这些数字映射到表示编码的字符串中。

¹不用Cython也是可以的and/orIPython,组合起来正好很顺手,不然代码会多(不用IPython) and/or 需要手动安装（没有 Cython）。

如何找出内部字符串编码？

How to find out internal string encoding?

python

string

encoding

python-3.x

python-internals