如何避免创建不必要的列表?
How to avoid creating unnecessary lists?
我经常遇到这样的情况:我从文件或任何地方提取一些信息,然后必须通过几个步骤将数据整理成最终所需的形式。例如:
def insight_pull(file):
with open(file) as in_f:
lines = in_f.readlines()
dirty = [line.split(' ') for line in lines]
clean = [i[1] for i in dirty]
cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
cleanest = [i[0].split() + i[1].split() for i in cleaner]
with open("Output_File.txt", "w") as out_f:
out_f.writelines(' '.join(i) + '\n' for i in cleanest)
按照上面的例子:
# Pull raw data from file splitting on ' '.
dirty = [line.split(' ') for line in lines]
# Select every 2nd element from each nested list.
clean = [i[1] for i in dirty]
# Couple every 2nd element with it's predecessor into a new list.
cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
# Split each entry in cleaner into the final formatted list.
cleanest = [i[0].split() + i[1].split() for i in cleaner]
鉴于我无法将所有编辑都放在一行或一个循环中(因为每次编辑都取决于之前的编辑),是否有更好的方法来构建这样的代码?
如果问题有点含糊,我们深表歉意。非常感谢任何意见。
根据你的例子,我假设只有 cleanest
列表对你有任何实际价值,其余只是中间步骤,可以随意丢弃。
假设是这种情况,为什么不在每个中间步骤中重复使用相同的变量,这样您就不会在内存中保存多个列表?
def insight_pull(file):
with open(file) as in_f:
my_list = in_f.readlines()
my_list = [line.split(' ') for line in my_list]
my_list = [i[1] for i in my_list]
my_list = [[my_list[i],my_list[i + 1]] for i in range(0, len(my_list),2)]
my_list = [i[0].split() + i[1].split() for i in my_list]
with open("Output_File.txt", "w") as out_f:
out_f.writelines(' '.join(i) + '\n' for i in my_list)
如果您考虑的是性能,那么您正在寻找生成器。生成器很像列表,但它们是惰性求值的,这意味着每个元素只在需要时产生。例如,在下面的序列中,我实际上并没有创建 3 个完整列表,每个元素只计算一次。下面只是一个使用生成器的示例(据我了解,您的代码只是您 运行 遇到的问题的示例,而不是具体问题):
# All even values from 2-18
even = (i*2 for i in range(1, 10))
# Only those divisible by 3
multiples_of_3 = (val for val in even if val % 3 == 0)
# And finally, we want to evaluate the remaining values as hex
hexes = [hex(val) for val in multiples_of_3]
# output: ['0x6', '0xc', '0x12']
前两个表达式是生成器,最后一个只是列表理解。当有很多步骤时,这将节省大量内存,因为您不会创建中间列表。请注意生成器不能被索引,并且它们只能被评估一次(它们只是值流)。
为了达到目的,我推荐流水线处理。我找到了一篇解释该技术的文章:generator pipelines。
这是我将循环直接转换为管道的尝试。该代码未经测试(因为我们没有要测试的数据)并且可能包含错误。
函数名称中的前导f
代表过滤器。
def fromfile(name):
# see coments
with open(name) as in_f:
for line in in_f:
yield line
def fsplit(pp):
for line in pp:
yield line.split(' ')
def fitem1(pp):
for item in pp:
yield item[1]
def fpairs(pp):
# edited
for x in pp:
try:
yield [x, next(pp)]
except StopIteration:
break
def fcleanup(pp):
for i in pp:
yield i[0].split() + i[1].split()
pipeline = fcleanup(fpairs(fitem1(fsplit(fromfile(NAME)))))
output = list(pipeline)
对于实际使用,我会聚合前 3 个过滤器以及接下来的 2 个过滤器。
生成器表达式
您不想创建多个列表是正确的。您的列表理解会创建一个全新的列表,浪费内存,并且您正在遍历每个列表!
@VPfB 使用生成器的想法是一个很好的解决方案,如果您的代码中有其他地方可以重用生成器。如果您不需要重用生成器,请使用生成器表达式。
生成器表达式是惰性的,就像生成器一样,所以当链接在一起时,就像这里一样,当 writelines 被调用时,循环将在最后计算一次。
def insight_pull(file):
with open(file) as in_f:
dirty = (line.split(' ') for line in in_f) # Combine with next
clean = (i[1] for i in dirty)
cleaner = (pair for pair in zip(clean,clean)) # Redundantly silly
cleanest = (i[0].split() + i[1].split() for i in cleaner)
# Don't build a single (possibily huge) string with join
with open("Output_File.txt", "w") as out_f:
out_f.writelines(' '.join(i) + '\n' for i in cleanest)
以上内容直接符合你的问题,你可以更进一步:
def insight_pull(file):
with open(file) as in_f:
clean = (line.split(' ')[0] for line in in_f)
cleaner = zip(clean,clean)
cleanest = (i[0].split() + i[1].split() for i in cleaner)
with open("Output_File.txt", "w") as out_f:
for line in cleanest:
out_f.write(line + '\n')
我经常遇到这样的情况:我从文件或任何地方提取一些信息,然后必须通过几个步骤将数据整理成最终所需的形式。例如:
def insight_pull(file):
with open(file) as in_f:
lines = in_f.readlines()
dirty = [line.split(' ') for line in lines]
clean = [i[1] for i in dirty]
cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
cleanest = [i[0].split() + i[1].split() for i in cleaner]
with open("Output_File.txt", "w") as out_f:
out_f.writelines(' '.join(i) + '\n' for i in cleanest)
按照上面的例子:
# Pull raw data from file splitting on ' '.
dirty = [line.split(' ') for line in lines]
# Select every 2nd element from each nested list.
clean = [i[1] for i in dirty]
# Couple every 2nd element with it's predecessor into a new list.
cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
# Split each entry in cleaner into the final formatted list.
cleanest = [i[0].split() + i[1].split() for i in cleaner]
鉴于我无法将所有编辑都放在一行或一个循环中(因为每次编辑都取决于之前的编辑),是否有更好的方法来构建这样的代码?
如果问题有点含糊,我们深表歉意。非常感谢任何意见。
根据你的例子,我假设只有 cleanest
列表对你有任何实际价值,其余只是中间步骤,可以随意丢弃。
假设是这种情况,为什么不在每个中间步骤中重复使用相同的变量,这样您就不会在内存中保存多个列表?
def insight_pull(file):
with open(file) as in_f:
my_list = in_f.readlines()
my_list = [line.split(' ') for line in my_list]
my_list = [i[1] for i in my_list]
my_list = [[my_list[i],my_list[i + 1]] for i in range(0, len(my_list),2)]
my_list = [i[0].split() + i[1].split() for i in my_list]
with open("Output_File.txt", "w") as out_f:
out_f.writelines(' '.join(i) + '\n' for i in my_list)
如果您考虑的是性能,那么您正在寻找生成器。生成器很像列表,但它们是惰性求值的,这意味着每个元素只在需要时产生。例如,在下面的序列中,我实际上并没有创建 3 个完整列表,每个元素只计算一次。下面只是一个使用生成器的示例(据我了解,您的代码只是您 运行 遇到的问题的示例,而不是具体问题):
# All even values from 2-18
even = (i*2 for i in range(1, 10))
# Only those divisible by 3
multiples_of_3 = (val for val in even if val % 3 == 0)
# And finally, we want to evaluate the remaining values as hex
hexes = [hex(val) for val in multiples_of_3]
# output: ['0x6', '0xc', '0x12']
前两个表达式是生成器,最后一个只是列表理解。当有很多步骤时,这将节省大量内存,因为您不会创建中间列表。请注意生成器不能被索引,并且它们只能被评估一次(它们只是值流)。
为了达到目的,我推荐流水线处理。我找到了一篇解释该技术的文章:generator pipelines。
这是我将循环直接转换为管道的尝试。该代码未经测试(因为我们没有要测试的数据)并且可能包含错误。
函数名称中的前导f
代表过滤器。
def fromfile(name):
# see coments
with open(name) as in_f:
for line in in_f:
yield line
def fsplit(pp):
for line in pp:
yield line.split(' ')
def fitem1(pp):
for item in pp:
yield item[1]
def fpairs(pp):
# edited
for x in pp:
try:
yield [x, next(pp)]
except StopIteration:
break
def fcleanup(pp):
for i in pp:
yield i[0].split() + i[1].split()
pipeline = fcleanup(fpairs(fitem1(fsplit(fromfile(NAME)))))
output = list(pipeline)
对于实际使用,我会聚合前 3 个过滤器以及接下来的 2 个过滤器。
生成器表达式
您不想创建多个列表是正确的。您的列表理解会创建一个全新的列表,浪费内存,并且您正在遍历每个列表!
@VPfB 使用生成器的想法是一个很好的解决方案,如果您的代码中有其他地方可以重用生成器。如果您不需要重用生成器,请使用生成器表达式。
生成器表达式是惰性的,就像生成器一样,所以当链接在一起时,就像这里一样,当 writelines 被调用时,循环将在最后计算一次。
def insight_pull(file):
with open(file) as in_f:
dirty = (line.split(' ') for line in in_f) # Combine with next
clean = (i[1] for i in dirty)
cleaner = (pair for pair in zip(clean,clean)) # Redundantly silly
cleanest = (i[0].split() + i[1].split() for i in cleaner)
# Don't build a single (possibily huge) string with join
with open("Output_File.txt", "w") as out_f:
out_f.writelines(' '.join(i) + '\n' for i in cleanest)
以上内容直接符合你的问题,你可以更进一步:
def insight_pull(file):
with open(file) as in_f:
clean = (line.split(' ')[0] for line in in_f)
cleaner = zip(clean,clean)
cleanest = (i[0].split() + i[1].split() for i in cleaner)
with open("Output_File.txt", "w") as out_f:
for line in cleanest:
out_f.write(line + '\n')