具有开放功能的生成器理解

Question

我正在尝试找出在逐行解析文件时使用生成器的最佳方式。哪种使用生成器理解会更好

第一个选项。

with open('some_file') as file:
    lines = (line for line in file)

第二个选项。

lines = (line for line in open('some_file'))

我知道它会产生相同的结果，但哪个更快/更有效率？

Answer 1

您不能组合生成器和上下文管理器（with 语句）。

生成器很懒惰。他们不会真正读取他们的源数据，直到有人向他们请求项目。

这似乎有效：

with open('some_file') as file:
    lines = (line for line in file)

但是当您稍后实际尝试在程序中读取一行时

for line in lines:
    print(line)

它将失败 ValueError: I/O operation on closed file.

这是因为上下文管理器已经关闭了文件 - 这是它生命中的唯一目的 - 并且生成器在 for 循环开始实际请求数据之前没有开始读取它。

你的第二个建议

lines = (line for line in open('some_file'))

遇到相反的问题。你 open() 文件，但除非你手动 close() 它（你不能，因为你不知道文件句柄），它将保持打开状态永远。这正是上下文管理器可以解决的问题。

总的来说，如果你想阅读文件，你可以...阅读文件：

with open('some_file') as file:
    lines = list(file)

或者您可以使用真正的发电机：

def lazy_reader(*args, **kwargs):
    with open(*args, **kwargs) as file:
        yield from file

然后你可以做

for line in lazy_reader('some_file', encoding="utf8"):
    print(line)

和lazy_reader()将在读取最后一行时关闭文件。

Answer 2

如果您想测试这样的东西，我建议您查看 timeit 模块。

让我们为您的两个测试设置一个工作版本，我将添加一些性能相同的附加选项。

这里有几个选项：

def test1(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return [line for line in file_in]

def test2(file_path):
    return [line for line in open(file_path, "r", encoding="utf-8")]

def test3(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return file_in.readlines()

def test4(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return list(file_in)

def test5(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        yield from file_in

让我们用一个文本文件来测试它们，该文本文件是莎士比亚全集的 10 倍，我碰巧拥有它来进行这样的测试。

如果我这样做：

print(test1('shakespeare2.txt') == test2('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test3('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test4('shakespeare2.txt'))
print(test1('shakespeare2.txt') == list(test5('shakespeare2.txt')))

我看到所有测试都产生相同的结果。

现在让我们为他们计时：

import timeit

setup = '''
file_path = "shakespeare2.txt"

def test1(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return [line for line in file_in]

def test2(file_path):
    return [line for line in open(file_path, "r", encoding="utf-8")]

def test3(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return file_in.readlines()

def test4(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return list(file_in)

def test5(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        yield from file_in
'''

print(timeit.timeit("test1(file_path)", setup=setup, number=100))
print(timeit.timeit("test2(file_path)", setup=setup, number=100))
print(timeit.timeit("test3(file_path)", setup=setup, number=100))
print(timeit.timeit("test4(file_path)", setup=setup, number=100))
print(timeit.timeit("list(test5(file_path))", setup=setup, number=100))

在我的笔记本电脑上显示：

9.65
9.79
9.29
9.08
9.85

对我来说，从性能的角度来看，你选择哪个并不重要。所以不要使用你的 test2() 策略 :-)

请注意，test5()（归功于@tomalak）从内存管理的角度来看可能很重要！

具有开放功能的生成器理解

Generator comprehension with open function

python

file

generator