为什么 python 字典更新慢得离谱?

why python dict update insanely slow?

我有一个 python 程序,它从文件中读取行并将它们放入字典中,简单来说,它看起来像:

data = {'file_name':''}
with open('file_name') as in_fd:
    for line in in_fd:
        data['file_name'] += line



data = {'file_name':[]}
with open('file_name') as in_fd:
    for line in in_fd:
    data['file_name'] = ''.join(data['file_name'])





import time
LOOPS = 10000
WORD = 'ABC'*100

buf1 = []
for i in xrange(LOOPS):
ss = ''.join(buf1)

buf2 = ''
for i in xrange(LOOPS):
    buf2 += WORD

buf3 = {'1':''}
for i in xrange(LOOPS):
    buf3['1'] += WORD

buf4 = {'1':[]}
for i in xrange(LOOPS):
buf4['1'] = ''.join(buf4['1'])

print s2-s1, s3-s2, s4-s3, s5-s4

在我的笔记本电脑中(mac pro 2013 mid,OS X 10.9.5,cpython 2.7.10),它的输出是:

0.00299620628357 0.00415587425232 3.49465799332 0.00231599807739


trivial_reference = []
buf2 = ''
for i in xrange(LOOPS):
    buf2 += WORD
    trivial_reference.append(buf2)  # add a trivial reference to avoid optimization

更改后,现在第二个循环需要 19 秒才能完成。所以这似乎只是一个优化问题,正如juanpa.arrivillaga所说。

+= 在构建大型字符串时表现非常糟糕,但在 CPython 中的一种情况下可以有效。 如下所述

为了确保更快的字符串连接,请使用 str.join()

来自 String Concatenation section under Python Performance Tips:


s = ""
for substring in list:
    s += substring

改用s = "".join(list)。前者是构建大型字符串时非常常见且灾难性的错误。

为什么 s += xs['1'] += xs[0] += x 快?

From Note 6:

CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.

CPython 的优化是,如果一个字符串只有一个引用,那么我们可以 resize it in-place.

/* Note that we don't have to modify *unicode for unshared Unicode objects, since we can modify them in-place. */


s[0] += x


temp = s[0]  # Extra reference. `S[0]` and `temp` both point to same string now.
temp += x
s[0] = temp


>>> lst = [1, 2, 3]
>>> def func():
...     lst[0] = 90
...     return 100
>>> lst[0] += func()
>>> print lst
[101, 2, 3]  # Not [190, 2, 3]

但通常从不使用 s += x 来连接字符串,始终在字符串集合上使用 str.join


LOOPS = 1000
WORD = 'ABC'*100

def list_append():
    buf1 = [WORD for _ in xrange(LOOPS)]
    return ''.join(buf1)

def str_concat():
    buf2 = ''
    for i in xrange(LOOPS):
        buf2 += WORD

def dict_val_concat():
    buf3 = {'1': ''}
    for i in xrange(LOOPS):
        buf3['1'] += WORD
    return buf3['1']

def list_val_concat():
    buf4 = ['']
    for i in xrange(LOOPS):
        buf4[0] += WORD
    return buf4[0]

def val_pop_concat():
    buf5 = ['']
    for i in xrange(LOOPS):
        val = buf5.pop()
        val += WORD
    return buf5[0]

def val_assign_concat():
    buf6 = ['']
    for i in xrange(LOOPS):
        val = buf6[0]
        val += WORD
        buf6[0] = val
    return buf6[0]

>>> %timeit list_append()
1000 loops, best of 3: 1.31 ms per loop
>>> %timeit str_concat()
100 loops, best of 3: 3.09 ms per loop
>>> %run so.py
>>> %timeit list_append()
10000 loops, best of 3: 71.2 us per loop
>>> %timeit str_concat()
1000 loops, best of 3: 276 us per loop
>>> %timeit dict_val_concat()
100 loops, best of 3: 9.66 ms per loop
>>> %timeit list_val_concat()
100 loops, best of 3: 9.64 ms per loop
>>> %timeit val_pop_concat()
1000 loops, best of 3: 556 us per loop
>>> %timeit val_assign_concat()
100 loops, best of 3: 9.31 ms per loop

val_pop_concat 在这里很快,因为通过使用 pop() 我们从列表中删除对该字符串的引用,现在 CPython 可以就地调整它的大小( 正确猜测) .