具有开放功能的生成器理解

Generator comprehension with open function

我正在尝试找出在逐行解析文件时使用生成器的最佳方式。 哪种使用生成器理解会更好

第一个选项。

with open('some_file') as file:
    lines = (line for line in file)

第二个选项。

lines = (line for line in open('some_file'))

我知道它会产生相同的结果,但哪个更快/更有效率?

您不能组合生成器和上下文管理器(with 语句)。

生成器很懒惰。他们不会真正读取他们的源数据,直到有人向他们请求项目。

似乎有效:

with open('some_file') as file:
    lines = (line for line in file)

但是当您稍后实际尝试在程序中读取一行时

for line in lines:
    print(line)

它将失败 ValueError: I/O operation on closed file.

这是因为上下文管理器已经关闭了文件 - 这是它生命中的唯一目的 - 并且生成器在 for 循环开始实际请求数据之前没有开始读取它。

你的第二个建议

lines = (line for line in open('some_file'))

遇到相反的问题。你 open() 文件,但除非你 手动 close() 它(你不能,因为你不知道文件句柄),它将保持打开状态永远。这正是上下文管理器可以解决的问题。

总的来说,如果你想阅读文件,你可以...阅读文件:

with open('some_file') as file:
    lines = list(file)

或者您可以使用真正的发电机:

def lazy_reader(*args, **kwargs):
    with open(*args, **kwargs) as file:
        yield from file

然后你可以做

for line in lazy_reader('some_file', encoding="utf8"):
    print(line)

lazy_reader()将在读取最后一行时关闭文件。

如果您想测试这样的东西,我建议您查看 timeit 模块。

让我们为您的两个测试设置一个工作版本,我将添加一些性能相同的附加选项。

这里有几个选项:

def test1(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return [line for line in file_in]

def test2(file_path):
    return [line for line in open(file_path, "r", encoding="utf-8")]

def test3(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return file_in.readlines()

def test4(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return list(file_in)

def test5(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        yield from file_in

让我们用一个文本文件来测试它们,该文本文件是莎士比亚全集的 10 倍,我碰巧拥有它来进行这样的测试。

如果我这样做:

print(test1('shakespeare2.txt') == test2('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test3('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test4('shakespeare2.txt'))
print(test1('shakespeare2.txt') == list(test5('shakespeare2.txt')))

我看到所有测试都产生相同的结果。

现在让我们为他们计时:

import timeit

setup = '''
file_path = "shakespeare2.txt"

def test1(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return [line for line in file_in]

def test2(file_path):
    return [line for line in open(file_path, "r", encoding="utf-8")]

def test3(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return file_in.readlines()

def test4(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return list(file_in)

def test5(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        yield from file_in
'''

print(timeit.timeit("test1(file_path)", setup=setup, number=100))
print(timeit.timeit("test2(file_path)", setup=setup, number=100))
print(timeit.timeit("test3(file_path)", setup=setup, number=100))
print(timeit.timeit("test4(file_path)", setup=setup, number=100))
print(timeit.timeit("list(test5(file_path))", setup=setup, number=100))

在我的笔记本电脑上显示:

9.65
9.79
9.29
9.08
9.85

对我来说,从性能的角度来看,你选择哪个并不重要。所以不要使用你的 test2() 策略 :-)

请注意,test5()(归功于@tomalak)从内存管理的角度来看可能很重要!