为什么 str.replace 对于单个异常值要慢得多?

Why is str.replace so much slower with a single outlier?

我测试了 s.replace('a', ''),其中 s 是一个包含 200 万 'a' 的字符串,并且可能在开头、中间或结尾有一个异常值 'b'。那个异常值使它慢得多:

At TIO 及其 Python 3.8 预发布版本:

a = 'a' * 10**6; s = f'{a}{a}'      5 ms    5 ms    5 ms 
a = 'a' * 10**6; s = f'b{a}{a}'    24 ms   24 ms   24 ms 
a = 'a' * 10**6; s = f'{a}b{a}'    25 ms   25 ms   25 ms 
a = 'a' * 10**6; s = f'{a}{a}b'    25 ms   25 ms   25 ms 
a = 'a' * 10**6; s = f'b{a}b{a}b'  25 ms   25 ms   25 ms 

在装有 Python 3.10 的旧笔记本电脑上:

a = 'a' * 10**6; s = f'{a}{a}'      4 ms    4 ms    4 ms 
a = 'a' * 10**6; s = f'b{a}{a}'    93 ms   94 ms   95 ms 
a = 'a' * 10**6; s = f'{a}b{a}'    94 ms   95 ms   95 ms 
a = 'a' * 10**6; s = f'{a}{a}b'    94 ms   94 ms   95 ms 
a = 'a' * 10**6; s = f'b{a}b{a}b'  95 ms   95 ms   96 ms

那个异常值 'b' 是如何让它变慢的?

完整基准代码:

from timeit import repeat

setups = [
    "a = 'a' * 10**6; s = f'{a}{a}'",
    "a = 'a' * 10**6; s = f'b{a}{a}'",
    "a = 'a' * 10**6; s = f'{a}b{a}'",
    "a = 'a' * 10**6; s = f'{a}{a}b'",
    "a = 'a' * 10**6; s = f'b{a}b{a}b'",
]

for _ in range(3):
    for setup in setups:
        times = repeat("s.replace('a', '')", setup, number=1)
        print(f'{setup:{max(map(len, setups))}}',
              *('%3d ms ' % (t * 1e3) for t in sorted(times)[:3]))
    print()

str.replace() 在 C 函数 replace() 中的 https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L10604 处实现。这是该函数的摘录。它表明,如果 returned 字符串 (new_size) 的大小将为 0,我们会提前停止并 return 一个空字符串。否则,我们将执行循环遍历输入字符串并执行替换的漫长任务 one-by-one.

new_size = slen + n * (len2 - len1);
if (new_size == 0) {
    u = unicode_new_empty();
    goto done;
}
if (new_size > (PY_SSIZE_T_MAX / rkind)) {
    PyErr_SetString(PyExc_OverflowError,
                    "replace string is too long");
    goto error;
}
u = PyUnicode_New(new_size, maxchar);
if (!u)
    goto error;
assert(PyUnicode_KIND(u) == rkind);
res = PyUnicode_DATA(u);
ires = i = 0;
if (len1 > 0) {
    while (n-- > 0) {
        /* look for next match */
        j = anylib_find(rkind, self,
                        sbuf + rkind * i, slen-i,
                        str1, buf1, len1, i);
        if (j == -1)
            break;
        else if (j > i) {
            /* copy unchanged part [i:j] */
            memcpy(res + rkind * ires,
                   sbuf + rkind * i,
                   rkind * (j-i));
            ires += j - i;
        }
        /* copy substitution string */
        if (len2 > 0) {
            memcpy(res + rkind * ires,
                   buf2,
                   rkind * len2);
            ires += len2;
        }
        i = j + len1;
    }
    if (i < slen)
        /* copy tail [i:] */
        memcpy(res + rkind * ires,
               sbuf + rkind * i,
               rkind * (slen-i));
}