避免在函数中重复加载文件
Avoid repeated load of file in function
我正在尝试编写一个具有获取 fasta 文件的函数的文件,并且 (i) 给出文件的概述,(ii) 绘制序列长度分布的直方图。我成功编写了以下有效代码:
from Bio import SeqIO
from prettytable import PrettyTable
import pylab
import numpy as np
import sys
%matplotlib inline
def fasta_outlook(fasta_file):
'''Summarize the fasta file with #ofseq, length(min,max). Takes filename as string'''
=> sizes = [len(rec) for rec in SeqIO.parse(fasta_file,"fasta")]
table = PrettyTable(['Parameter', 'Stats'])
table.add_row(['No. of Sequences', len(sizes)])
table.add_row(['Shortest seq.length', min(sizes)])
table.add_row(['Longest Seq.length', max(sizes)])
print(table)
def fasta_burst(fasta_file):
'''Reports the length of each fasta sequence in the file. Takes filename as string'''
my_file = open("Seq_length.tab","w")
=> for rec in SeqIO.parse(fasta_file,"fasta"):
my_file.write(rec.id+'\t'+str(len(rec))+'\n')
print("Length report written in Seq_length.tab")
def fasta_lendist(fasta_file):
'''plot the distribution of sequence length as histogram. Takes filename as string'''
=> sizes = [len(rec) for rec in SeqIO.parse(fasta_file,"fasta")]
count,bins,_ = pylab.hist(sizes, bins=100, log=True, histtype='step',color='red')
pylab.title("%i seq with len: %i to %i bp (range)\nBin Max: %i seq around %i bp"%(len(sizes),min(sizes),max(sizes),count.max(),bins[np.argmax(count)]))
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Log Count")
pylab.savefig("Sequence_length_distribution_plot.png")
print("Plot saved as Sequence_length_distribution_plot.png")
fasta = 'filename.fa'
fasta_outlook(fasta)
fasta_lendist(fasta)
这里的问题是,我在所有函数中重复加载文件 (=>)。是否可以全局只加载一次文件并在后续函数中使用该对象?函数的参数是否采用对象而不是文件名(字符串)?
一次读取所有记录,然后将它们传递给您的函数。如果您的 FASTA 文件非常大,这可能是一个非常糟糕的主意。在脚本底部:
fasta = 'filename.fa'
records = [record for record in SeqIO.parse(fasta,"fasta")]
fasta_outlook(records)
fasta_lendist(records
您的其中一个函数现在看起来像这样:
def fasta_outlook(fasta_records):
sizes = [len(rec) for rec in fasta_records]
table = PrettyTable(['Parameter', 'Stats'])
table.add_row(['No. of Sequences', len(sizes)])
table.add_row(['Shortest seq.length', min(sizes)])
table.add_row(['Longest Seq.length', max(sizes)])
print(table)
您似乎只使用文件中的记录长度和 ID。您可以将它们加载到一个元组列表或两个单独的列表中,然后传递它们。肯定没有理由一遍又一遍地解析您的文件。
先写一个函数加载相关数据。我认为一对列表更好,因为您只使用一次 ID:
def load_file(filename):
data = [(rec.id, len(rec)) for rec in SeqIO.parse(fasta_file, "fasta")]
# Transpose the data into two lists instead of list of pairs
return tuple(map(list, zip(*data)))
现在你的函数调用应该看起来像
fasta = 'filename.fa'
ids, sizes = load_file(fasta)
fasta_outlook(sizes)
fasta_lendist(sizes)
fasta_burst(ids, sizes)
在 fasta_outlook
和 fasta_lendist
中,您只需将输入参数名称更改为 sizes
并删除计算这些值的理解。在fasta_burst
中,你可以稍微简化一下循环:
def fasta_burst(ids, sizes):
'''Reports the length of each fasta sequence in the file. Takes filename as string'''
with open("Seq_length.tab","w") as my_file:
for id, rec in zip(ids, sizes):
my_file.write('{}\t{}\n'.format(id, size))
print("Length report written in Seq_length.tab")
使用 with
块确保您的文件在完成后关闭。您之前根本没有关闭,with
具有即使发生错误也会关闭的优势。
我正在尝试编写一个具有获取 fasta 文件的函数的文件,并且 (i) 给出文件的概述,(ii) 绘制序列长度分布的直方图。我成功编写了以下有效代码:
from Bio import SeqIO
from prettytable import PrettyTable
import pylab
import numpy as np
import sys
%matplotlib inline
def fasta_outlook(fasta_file):
'''Summarize the fasta file with #ofseq, length(min,max). Takes filename as string'''
=> sizes = [len(rec) for rec in SeqIO.parse(fasta_file,"fasta")]
table = PrettyTable(['Parameter', 'Stats'])
table.add_row(['No. of Sequences', len(sizes)])
table.add_row(['Shortest seq.length', min(sizes)])
table.add_row(['Longest Seq.length', max(sizes)])
print(table)
def fasta_burst(fasta_file):
'''Reports the length of each fasta sequence in the file. Takes filename as string'''
my_file = open("Seq_length.tab","w")
=> for rec in SeqIO.parse(fasta_file,"fasta"):
my_file.write(rec.id+'\t'+str(len(rec))+'\n')
print("Length report written in Seq_length.tab")
def fasta_lendist(fasta_file):
'''plot the distribution of sequence length as histogram. Takes filename as string'''
=> sizes = [len(rec) for rec in SeqIO.parse(fasta_file,"fasta")]
count,bins,_ = pylab.hist(sizes, bins=100, log=True, histtype='step',color='red')
pylab.title("%i seq with len: %i to %i bp (range)\nBin Max: %i seq around %i bp"%(len(sizes),min(sizes),max(sizes),count.max(),bins[np.argmax(count)]))
pylab.xlabel("Sequence length (bp)")
pylab.ylabel("Log Count")
pylab.savefig("Sequence_length_distribution_plot.png")
print("Plot saved as Sequence_length_distribution_plot.png")
fasta = 'filename.fa'
fasta_outlook(fasta)
fasta_lendist(fasta)
这里的问题是,我在所有函数中重复加载文件 (=>)。是否可以全局只加载一次文件并在后续函数中使用该对象?函数的参数是否采用对象而不是文件名(字符串)?
一次读取所有记录,然后将它们传递给您的函数。如果您的 FASTA 文件非常大,这可能是一个非常糟糕的主意。在脚本底部:
fasta = 'filename.fa'
records = [record for record in SeqIO.parse(fasta,"fasta")]
fasta_outlook(records)
fasta_lendist(records
您的其中一个函数现在看起来像这样:
def fasta_outlook(fasta_records):
sizes = [len(rec) for rec in fasta_records]
table = PrettyTable(['Parameter', 'Stats'])
table.add_row(['No. of Sequences', len(sizes)])
table.add_row(['Shortest seq.length', min(sizes)])
table.add_row(['Longest Seq.length', max(sizes)])
print(table)
您似乎只使用文件中的记录长度和 ID。您可以将它们加载到一个元组列表或两个单独的列表中,然后传递它们。肯定没有理由一遍又一遍地解析您的文件。
先写一个函数加载相关数据。我认为一对列表更好,因为您只使用一次 ID:
def load_file(filename):
data = [(rec.id, len(rec)) for rec in SeqIO.parse(fasta_file, "fasta")]
# Transpose the data into two lists instead of list of pairs
return tuple(map(list, zip(*data)))
现在你的函数调用应该看起来像
fasta = 'filename.fa'
ids, sizes = load_file(fasta)
fasta_outlook(sizes)
fasta_lendist(sizes)
fasta_burst(ids, sizes)
在 fasta_outlook
和 fasta_lendist
中,您只需将输入参数名称更改为 sizes
并删除计算这些值的理解。在fasta_burst
中,你可以稍微简化一下循环:
def fasta_burst(ids, sizes):
'''Reports the length of each fasta sequence in the file. Takes filename as string'''
with open("Seq_length.tab","w") as my_file:
for id, rec in zip(ids, sizes):
my_file.write('{}\t{}\n'.format(id, size))
print("Length report written in Seq_length.tab")
使用 with
块确保您的文件在完成后关闭。您之前根本没有关闭,with
具有即使发生错误也会关闭的优势。