具有开放功能的生成器理解
Generator comprehension with open function
我正在尝试找出在逐行解析文件时使用生成器的最佳方式。
哪种使用生成器理解会更好
第一个选项。
with open('some_file') as file:
lines = (line for line in file)
第二个选项。
lines = (line for line in open('some_file'))
我知道它会产生相同的结果,但哪个更快/更有效率?
您不能组合生成器和上下文管理器(with
语句)。
生成器很懒惰。他们不会真正读取他们的源数据,直到有人向他们请求项目。
这似乎有效:
with open('some_file') as file:
lines = (line for line in file)
但是当您稍后实际尝试在程序中读取一行时
for line in lines:
print(line)
它将失败 ValueError: I/O operation on closed file.
这是因为上下文管理器已经关闭了文件 - 这是它生命中的唯一目的 - 并且生成器在 for
循环开始实际请求数据之前没有开始读取它。
你的第二个建议
lines = (line for line in open('some_file'))
遇到相反的问题。你 open()
文件,但除非你 手动 close()
它(你不能,因为你不知道文件句柄),它将保持打开状态永远。这正是上下文管理器可以解决的问题。
总的来说,如果你想阅读文件,你可以...阅读文件:
with open('some_file') as file:
lines = list(file)
或者您可以使用真正的发电机:
def lazy_reader(*args, **kwargs):
with open(*args, **kwargs) as file:
yield from file
然后你可以做
for line in lazy_reader('some_file', encoding="utf8"):
print(line)
和lazy_reader()
将在读取最后一行时关闭文件。
如果您想测试这样的东西,我建议您查看 timeit
模块。
让我们为您的两个测试设置一个工作版本,我将添加一些性能相同的附加选项。
这里有几个选项:
def test1(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return [line for line in file_in]
def test2(file_path):
return [line for line in open(file_path, "r", encoding="utf-8")]
def test3(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return file_in.readlines()
def test4(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return list(file_in)
def test5(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
yield from file_in
让我们用一个文本文件来测试它们,该文本文件是莎士比亚全集的 10 倍,我碰巧拥有它来进行这样的测试。
如果我这样做:
print(test1('shakespeare2.txt') == test2('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test3('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test4('shakespeare2.txt'))
print(test1('shakespeare2.txt') == list(test5('shakespeare2.txt')))
我看到所有测试都产生相同的结果。
现在让我们为他们计时:
import timeit
setup = '''
file_path = "shakespeare2.txt"
def test1(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return [line for line in file_in]
def test2(file_path):
return [line for line in open(file_path, "r", encoding="utf-8")]
def test3(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return file_in.readlines()
def test4(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return list(file_in)
def test5(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
yield from file_in
'''
print(timeit.timeit("test1(file_path)", setup=setup, number=100))
print(timeit.timeit("test2(file_path)", setup=setup, number=100))
print(timeit.timeit("test3(file_path)", setup=setup, number=100))
print(timeit.timeit("test4(file_path)", setup=setup, number=100))
print(timeit.timeit("list(test5(file_path))", setup=setup, number=100))
在我的笔记本电脑上显示:
9.65
9.79
9.29
9.08
9.85
对我来说,从性能的角度来看,你选择哪个并不重要。所以不要使用你的 test2()
策略 :-)
请注意,test5()
(归功于@tomalak)从内存管理的角度来看可能很重要!
我正在尝试找出在逐行解析文件时使用生成器的最佳方式。 哪种使用生成器理解会更好
第一个选项。
with open('some_file') as file:
lines = (line for line in file)
第二个选项。
lines = (line for line in open('some_file'))
我知道它会产生相同的结果,但哪个更快/更有效率?
您不能组合生成器和上下文管理器(with
语句)。
生成器很懒惰。他们不会真正读取他们的源数据,直到有人向他们请求项目。
这似乎有效:
with open('some_file') as file:
lines = (line for line in file)
但是当您稍后实际尝试在程序中读取一行时
for line in lines:
print(line)
它将失败 ValueError: I/O operation on closed file.
这是因为上下文管理器已经关闭了文件 - 这是它生命中的唯一目的 - 并且生成器在 for
循环开始实际请求数据之前没有开始读取它。
你的第二个建议
lines = (line for line in open('some_file'))
遇到相反的问题。你 open()
文件,但除非你 手动 close()
它(你不能,因为你不知道文件句柄),它将保持打开状态永远。这正是上下文管理器可以解决的问题。
总的来说,如果你想阅读文件,你可以...阅读文件:
with open('some_file') as file:
lines = list(file)
或者您可以使用真正的发电机:
def lazy_reader(*args, **kwargs):
with open(*args, **kwargs) as file:
yield from file
然后你可以做
for line in lazy_reader('some_file', encoding="utf8"):
print(line)
和lazy_reader()
将在读取最后一行时关闭文件。
如果您想测试这样的东西,我建议您查看 timeit
模块。
让我们为您的两个测试设置一个工作版本,我将添加一些性能相同的附加选项。
这里有几个选项:
def test1(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return [line for line in file_in]
def test2(file_path):
return [line for line in open(file_path, "r", encoding="utf-8")]
def test3(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return file_in.readlines()
def test4(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return list(file_in)
def test5(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
yield from file_in
让我们用一个文本文件来测试它们,该文本文件是莎士比亚全集的 10 倍,我碰巧拥有它来进行这样的测试。
如果我这样做:
print(test1('shakespeare2.txt') == test2('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test3('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test4('shakespeare2.txt'))
print(test1('shakespeare2.txt') == list(test5('shakespeare2.txt')))
我看到所有测试都产生相同的结果。
现在让我们为他们计时:
import timeit
setup = '''
file_path = "shakespeare2.txt"
def test1(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return [line for line in file_in]
def test2(file_path):
return [line for line in open(file_path, "r", encoding="utf-8")]
def test3(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return file_in.readlines()
def test4(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
return list(file_in)
def test5(file_path):
with open(file_path, "r", encoding="utf-8") as file_in:
yield from file_in
'''
print(timeit.timeit("test1(file_path)", setup=setup, number=100))
print(timeit.timeit("test2(file_path)", setup=setup, number=100))
print(timeit.timeit("test3(file_path)", setup=setup, number=100))
print(timeit.timeit("test4(file_path)", setup=setup, number=100))
print(timeit.timeit("list(test5(file_path))", setup=setup, number=100))
在我的笔记本电脑上显示:
9.65
9.79
9.29
9.08
9.85
对我来说,从性能的角度来看,你选择哪个并不重要。所以不要使用你的 test2()
策略 :-)
请注意,test5()
(归功于@tomalak)从内存管理的角度来看可能很重要!