为什么 str.replace 对于单个异常值要慢得多?
Why is str.replace so much slower with a single outlier?
我测试了 s.replace('a', '')
,其中 s
是一个包含 200 万 'a'
的字符串,并且可能在开头、中间或结尾有一个异常值 'b'
。那个异常值使它慢得多:
At TIO 及其 Python 3.8 预发布版本:
a = 'a' * 10**6; s = f'{a}{a}' 5 ms 5 ms 5 ms
a = 'a' * 10**6; s = f'b{a}{a}' 24 ms 24 ms 24 ms
a = 'a' * 10**6; s = f'{a}b{a}' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'{a}{a}b' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 25 ms 25 ms 25 ms
在装有 Python 3.10 的旧笔记本电脑上:
a = 'a' * 10**6; s = f'{a}{a}' 4 ms 4 ms 4 ms
a = 'a' * 10**6; s = f'b{a}{a}' 93 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'{a}b{a}' 94 ms 95 ms 95 ms
a = 'a' * 10**6; s = f'{a}{a}b' 94 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 95 ms 95 ms 96 ms
那个异常值 'b
' 是如何让它变慢的?
完整基准代码:
from timeit import repeat
setups = [
"a = 'a' * 10**6; s = f'{a}{a}'",
"a = 'a' * 10**6; s = f'b{a}{a}'",
"a = 'a' * 10**6; s = f'{a}b{a}'",
"a = 'a' * 10**6; s = f'{a}{a}b'",
"a = 'a' * 10**6; s = f'b{a}b{a}b'",
]
for _ in range(3):
for setup in setups:
times = repeat("s.replace('a', '')", setup, number=1)
print(f'{setup:{max(map(len, setups))}}',
*('%3d ms ' % (t * 1e3) for t in sorted(times)[:3]))
print()
str.replace()
在 C 函数 replace()
中的 https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L10604 处实现。这是该函数的摘录。它表明,如果 returned 字符串 (new_size
) 的大小将为 0,我们会提前停止并 return 一个空字符串。否则,我们将执行循环遍历输入字符串并执行替换的漫长任务 one-by-one.
new_size = slen + n * (len2 - len1);
if (new_size == 0) {
u = unicode_new_empty();
goto done;
}
if (new_size > (PY_SSIZE_T_MAX / rkind)) {
PyErr_SetString(PyExc_OverflowError,
"replace string is too long");
goto error;
}
u = PyUnicode_New(new_size, maxchar);
if (!u)
goto error;
assert(PyUnicode_KIND(u) == rkind);
res = PyUnicode_DATA(u);
ires = i = 0;
if (len1 > 0) {
while (n-- > 0) {
/* look for next match */
j = anylib_find(rkind, self,
sbuf + rkind * i, slen-i,
str1, buf1, len1, i);
if (j == -1)
break;
else if (j > i) {
/* copy unchanged part [i:j] */
memcpy(res + rkind * ires,
sbuf + rkind * i,
rkind * (j-i));
ires += j - i;
}
/* copy substitution string */
if (len2 > 0) {
memcpy(res + rkind * ires,
buf2,
rkind * len2);
ires += len2;
}
i = j + len1;
}
if (i < slen)
/* copy tail [i:] */
memcpy(res + rkind * ires,
sbuf + rkind * i,
rkind * (slen-i));
}
我测试了 s.replace('a', '')
,其中 s
是一个包含 200 万 'a'
的字符串,并且可能在开头、中间或结尾有一个异常值 'b'
。那个异常值使它慢得多:
At TIO 及其 Python 3.8 预发布版本:
a = 'a' * 10**6; s = f'{a}{a}' 5 ms 5 ms 5 ms
a = 'a' * 10**6; s = f'b{a}{a}' 24 ms 24 ms 24 ms
a = 'a' * 10**6; s = f'{a}b{a}' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'{a}{a}b' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 25 ms 25 ms 25 ms
在装有 Python 3.10 的旧笔记本电脑上:
a = 'a' * 10**6; s = f'{a}{a}' 4 ms 4 ms 4 ms
a = 'a' * 10**6; s = f'b{a}{a}' 93 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'{a}b{a}' 94 ms 95 ms 95 ms
a = 'a' * 10**6; s = f'{a}{a}b' 94 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 95 ms 95 ms 96 ms
那个异常值 'b
' 是如何让它变慢的?
完整基准代码:
from timeit import repeat
setups = [
"a = 'a' * 10**6; s = f'{a}{a}'",
"a = 'a' * 10**6; s = f'b{a}{a}'",
"a = 'a' * 10**6; s = f'{a}b{a}'",
"a = 'a' * 10**6; s = f'{a}{a}b'",
"a = 'a' * 10**6; s = f'b{a}b{a}b'",
]
for _ in range(3):
for setup in setups:
times = repeat("s.replace('a', '')", setup, number=1)
print(f'{setup:{max(map(len, setups))}}',
*('%3d ms ' % (t * 1e3) for t in sorted(times)[:3]))
print()
str.replace()
在 C 函数 replace()
中的 https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L10604 处实现。这是该函数的摘录。它表明,如果 returned 字符串 (new_size
) 的大小将为 0,我们会提前停止并 return 一个空字符串。否则,我们将执行循环遍历输入字符串并执行替换的漫长任务 one-by-one.
new_size = slen + n * (len2 - len1);
if (new_size == 0) {
u = unicode_new_empty();
goto done;
}
if (new_size > (PY_SSIZE_T_MAX / rkind)) {
PyErr_SetString(PyExc_OverflowError,
"replace string is too long");
goto error;
}
u = PyUnicode_New(new_size, maxchar);
if (!u)
goto error;
assert(PyUnicode_KIND(u) == rkind);
res = PyUnicode_DATA(u);
ires = i = 0;
if (len1 > 0) {
while (n-- > 0) {
/* look for next match */
j = anylib_find(rkind, self,
sbuf + rkind * i, slen-i,
str1, buf1, len1, i);
if (j == -1)
break;
else if (j > i) {
/* copy unchanged part [i:j] */
memcpy(res + rkind * ires,
sbuf + rkind * i,
rkind * (j-i));
ires += j - i;
}
/* copy substitution string */
if (len2 > 0) {
memcpy(res + rkind * ires,
buf2,
rkind * len2);
ires += len2;
}
i = j + len1;
}
if (i < slen)
/* copy tail [i:] */
memcpy(res + rkind * ires,
sbuf + rkind * i,
rkind * (slen-i));
}