Return 迭代器 vs Return 整个列表 Python？

Question

我测试了一些代码以了解哪个有效，返回迭代器并返回整个列表。

该程序是关于读取 .txt 文件的所有行（非常大）并创建字数统计词典（Python3.4）。

1.Returning迭代器

from collections import defaultdict
import time

def create_word_cnt_dict(line_iter):
    doc_vector = defaultdict(int)
    for line in line_iter:
        for word in line.split():
            doc_vector[word] += 1
    return dict(doc_vector)

def read_doc(doc_file):
    with open(doc_file) as f :
        while True:
            line = f.readline()
            if not line:
                break
            yield line

t0 = time.time()
line_iter = read_doc("./doc1.txt")
doc_vector = create_word_cnt_dict(line_iter)
t1 = time.time()
print(t1-t0)

需要，3.765739917755127

2.Returning 整个列表

from collections import defaultdict
import time

def create_word_cnt_dict(line_list):
    doc_vector = defaultdict(int)
    for line in line_list:
        for word in line.split():
            doc_vector[word] += 1
    return dict(doc_vector)

def read_doc1(doc_file):
    with open(doc_file) as f :
        lines = f.readlines()
        return lines

t0 = time.time()
lines = read_doc1("./doc1.txt")
doc_vector = create_word_cnt_dict(lines)
t1 = time.time()
print(t1-t0)

需要，3.6890149116516113

如您所见，返回整个列表要快得多。

但在内存使用方面，返回迭代器比返回整个列表更有效。

在书 Effective Python 中，建议返回迭代器以提高内存使用率。但我认为现在时间复杂度比 space 复杂度更重要，因为今天的计算机有足够的内存。

请给我一些建议。

Answer 1

在这种情况下，我认为您对 "much faster" 的解释与我的不同。 . .时间差异大约有百分之几，这不是很大（除非您的程序运行小时，否则用户可能不会注意到，然后差异微不足道。）

再加上迭代器给你更多的灵活性。如果你想在处理某行时停止读取行怎么办？在那种情况下，迭代器可能会快 2 倍或更多倍，因为您已经获得了 "short circuit".

的能力

出于短路和记忆的原因，我更喜欢这里的生成器功能。

^{另请注意，您正在阅读文件的时间可能会有所偏差。 readlines 可能会更有效率，因为 python 可以读取比正常情况下更大的块中的文件，这意味着对 OS 的调用更少。许多其他应用程序不会有这种微妙之处...}

Answer 2

视情况而定。

如果我们讨论的是相对少量的数据，那么时间复杂度也不会有什么不同。

考虑大量数据，我不是在谈论 Gbs 或 TBs，像 Google 和 Facebook 这样的大公司每天需要处理的更大的数据集，你认为 space complexity 不像 time complexity 那样吗？

Space我们显然不是在谈论存储内存而是RAM。

所以你的问题很宽泛，它取决于应用程序、你将要使用的数据量和你的要求。对于相对较小的数据集，我认为时间复杂度和 space 复杂度都不是什么大问题。

Answer 3

性能差异其实很小。

鉴于此，优秀的程序员会选择生成器版本，因为它很健壮。

如果你吞下整个文件，你就是在设下一个陷阱。在未来的某个时候，有人（也许是你）会尝试传入 1GB 或 10GB，他们会被搞砸，然后运行大骂 "WHY??????"

Return 迭代器 vs Return 整个列表 Python？

Return iterator vs Return whole list in Python?

python

performance

time-complexity

space-complexity