如何在两个不同的内存位置创建 str "1"？

Question

我们可以这样打败小整数实习生（一个计算可以让我们避开缓存层）：

>>> n = 674039
>>> one1 = 1
>>> one2 = (n ** 9 + 1) % (n ** 9)
>>> one1 == one2
True
>>> one1 is one2
False

如何打败小串实习生，即看到如下结果：

>>> one1 = "1"
>>> one2 = <???>
>>> type(one2) is str and one1 == one2
True
>>> one1 is one2
False

sys.intern 提到“Interned strings are not immortal”，但没有关于字符串如何被踢出 intern 或如何创建的上下文一个 str 避免缓存层的实例。

_{由于实习是 CPython 实现细节，依赖于未记录的实现细节的答案是 ok/expected.}

Answer 1

仅由一个字符组成的 Unicode（值小于 128 或更精确地来自 latin1）是最复杂的情况，因为这些字符串实际上 interned but (more similar to the integer pool or identically to the behavior for bytes) are created at the start and are stored in an array 只要解释器还活着：

truct _Py_unicode_state {
    ...
    /* Single character Unicode strings in the Latin-1 range are being
       shared as well. */
    PyObject *latin1[256];
    ...
    /* This dictionary holds all interned unicode strings...
    */
    PyObject *interned;
    ...
};

所以每次创建长度为 1 的 unicode 时，如果字符值在 latin1 数组中，就会查找该字符值。例如。在 unicode_decode_utf8:

/* ASCII is equivalent to the first 128 ordinals in Unicode. */
    if (size == 1 && (unsigned char)s[0] < 128) {
        if (consumed) {
            *consumed = 1;
        }
        return get_latin1_char((unsigned char)s[0]);
    }

如果有一种方法可以在解释器中规避这一点，甚至可以争论 - 我们谈论的是（性能）错误。

一种可能是我们自己使用 C-API 填充 unicode-data。我使用 Cython 进行概念验证，但 ctypes 也可用于相同的效果：

%%cython
cdef extern from *:
    """
    PyObject* create_new_unicode(char *ch) 
    {
       PyUnicodeObject *ob = (PyUnicodeObject *)PyUnicode_New(1, 127);
       Py_UCS1 *data = PyUnicode_1BYTE_DATA(ob);
       data[0]=ch[0]; //fill data without using the unicode_decode_utf8
       return (PyObject*)ob;
    }
    """
    object create_new_unicode(char *ch)
    
def gen1():
    return create_new_unicode(b"1")

值得注意的细节：

PyUnicode_New 不会在 latin1 中查找，因为字符尚未设置。
为简单起见，以上仅适用于 ASCII 字符 - 因此我们将 127 作为 maxchar 传递给 PyUnicode_New。因此，我们可以通过 PyUnicode_1BYTE_DATA 解释数据，这使得操作起来很容易，而无需手动操作。

现在：

a,b=gen1(), gen1()
a is b, a == b
# yields (False, True)

随心所欲。

这里有一个类似的想法，但是用 ctypes:

实现

from ctypes import POINTER, py_object, c_ssize_t, byref, pythonapi
PyUnicode_New = pythonapi.PyUnicode_New
PyUnicode_New.argtypes = (c_ssize_t, c_ssize_t)
PyUnicode_New.restype = py_object
PyUnicode_CopyCharacters = pythonapi._PyUnicode_FastCopyCharacters
PyUnicode_CopyCharacters.argtypes = (py_object, c_ssize_t, py_object, c_ssize_t, c_ssize_t)
PyUnicode_CopyCharacters.restype = c_ssize_t

def clone(orig):
    cloned = PyUnicode_New(1,127)
    PyUnicode_CopyCharacters(cloned, 0, orig, 0, 1)
    return cloned

值得注意的细节：

无法将 PyUnicode_1BYTE_DATA 与 ctypes 一起使用，因为它是一个宏。另一种方法是计算 data-member 的偏移量并直接访问此内存（但这取决于平台并且感觉不太便携）
作为解决方法，使用了PyUnicode_CopyCharacters（可能还有其他实现相同的可能性），这比直接calculating/accessing内存更抽象和可移植。
实际上，使用 _PyUnicode_FastCopyCharacters，因为 PyUnicode_CopyCharacters 会检查目标 unicode 是否有多个引用并抛出。 _PyUnicode_FastCopyCharacters 不执行这些检查并按要求执行。

现在：

a="1"
b=clone(a)
a is b, a==b
# yields (False, True)

对于长度超过 1 个字符的字符串，避免驻留要容易得多，例如：

a="12"
b="123"[0:2]
a is b, a == b
#yields (False, True)

如何在两个不同的内存位置创建 str "1"？

How to create the str "1" at two different memory locations?

python

implementation

caching

cpython

string-interning