字符串字符同一性悖论

Question

我完全坚持这个

>>> s = chr(8263)
>>> x = s[0]
>>> x is s[0]
False

这怎么可能？这是否意味着通过索引访问字符串字符会创建相同字符的新实例？让我们来做个实验：

>>> L = [s[0] for _ in range(1000)]
>>> len(set(L))
1
>>> ids = map(id, L)
>>> len(set(ids))
1000
>>>

Yikes 多么浪费字节 ;) 或者这是否意味着 str.__getitem__ 有一个隐藏的功能？有人可以解释一下吗？

但这还不是我惊喜的结束：

>>> s = chr(8263)
>>> t = s
>>> print(t is s, id(t) == id(s))
True True

这就清楚了：t是s的别名，所以他们代表同一个对象，身份不谋而合。但同样，以下是如何可能的：

>>> print(t[0] is s[0])
False

s 和 t 是同一个对象那又怎样？

但更糟的是：

>>> print(id(t[0]) == id(s[0]))
True

t[0] 和 s[0] 未被垃圾回收，被 is 运算符视为同一对象但具有不同的 ID？有人可以解释一下吗？

Answer 1

is 比较身份，== 比较值。检查这个 doc

Every object has an identity, a type and a value. An object’s identity never changes once it has been created; you may think of it as the object’s address in memory. The ‘is‘ operator compares the identity of two objects; the id() function returns an integer representing its identity (currently implemented as its address). An object’s type is also unchangeable.

Answer 2

这里有两点要说明。

首先，Python 确实通过 __getitem__ 调用创建了一个 new 字符，但前提是该字符具有序数值大于大于 256.

例如：

>>> s = chr(256)
>>> s[0] is s
True

>>> t = chr(257)
>>> t[0] is t
False

这是因为在内部，编译后的 getitem function checks the ordinal value of the single chracter and calls the get_latin1_char 如果该值等于或小于 256。这允许共享一些单字符字符串。否则，将创建一个新的 unicode 对象。

第二个问题涉及垃圾收集，表明解释器可以非常快速地重用内存地址。当你写：

>>> s = t # = chr(257)
>>> t[0] is s[0]
False

Python先新建两个单字符串，然后比较它们的内存地址。它们有不同的地址（根据上面的解释我们有不同的对象）所以比较对象 is returns False.

另一方面，我们可能会遇到看似矛盾的情况：

>>> id(t[0]) == id(s[0])
True

但这是因为解释器在稍后创建新字符串 s[0] 时会快速重用 t[0] 的内存地址。

如果您检查此行生成的字节码（例如使用 dis - 见下文），您会看到每一侧的地址一个接一个地分配（创建一个新的字符串对象然后 id 被调用）。

一旦返回 id(t[0])，对对象 t[0] 的引用就会降为零（我们现在正在对整数进行比较，而不是对象本身）。这意味着 s[0] 可以在之后创建时重复使用相同的内存地址。

这是我注释过的 id(t[0]) == id(s[0]) 行的反汇编字节码。

您可以看到 t[0] 的生命周期在 s[0] 被创建之前结束（没有对它的引用）因此它的内存可以被重用。

  2           0 LOAD_GLOBAL              0 (id)
              3 LOAD_GLOBAL              1 (t)
              6 LOAD_CONST               1 (0)
              9 BINARY_SUBSCR                     # t[0] is created
             10 CALL_FUNCTION            1        # id(t[0]) is computed...
                                                  # ...lifetime of string t[0] over
             13 LOAD_GLOBAL              0 (id)
             16 LOAD_GLOBAL              2 (s)
             19 LOAD_CONST               1 (0)
             22 BINARY_SUBSCR                     # s[0] is created...
                                                  # ...free to reuse t[0] memory
             23 CALL_FUNCTION            1        # id(s[0]) is computed
             26 COMPARE_OP               2 (==)   # the two ids are compared
             29 RETURN_VALUE

字符串字符同一性悖论

String character identity paradox

python

string

python-internals