如何加速 Python 的 (fasta) 子采样程序?
How to speed up (fasta) subsampling program for Python?
我设计了一个小脚本,可以从原始文件中对 x 行进行子采样。原始文件是 fasta,每个序列有两行,程序提取那些 x 数量的序列(两行一起)。
这是它的样子:
#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))
# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")
# Define lists
fNames = []
fSeqs = []
# Extract fasta file into the two lists
for line in infile:
if line.startswith(">"):
fNames.append(line.rstrip())
else:
fSeqs.append(line.rstrip())
# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")
# Take random items out of the list for the total number of samples required
for j in range(num):
a = random.randint(0, (len(fNames)-1))
print(fNames.pop(a), file = outfile)
print(fSeqs.pop(a), file = outfile)
infile.close()
outfile.close()
input("Done.")
创建带有 ID 和核苷酸的列表(分别为第 1 行和第 2 行)进行得非常快,但打印出来需要很长时间。提取的数字最多可达 2M,但从 10000 开始变慢。
我想知道有没有办法让它更快。 .pop
是问题吗?如果我先创建一个唯一数字的随机列表然后提取它们会更快吗?
最后,终端在打印Done.
后没有返回"normal finished state",我也不知道为什么。对于我所有的其他脚本,我可以在完成后立即输入。
random.sample(在评论中建议)和字典使脚本更快。
这是最终脚本:
#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))
# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")
# Define list and dictionary
fNames = []
dicfasta = {}
# Extract fasta file into the two lists
for line in infile:
if line.startswith(">"):
fNames.append(line.rstrip())
Id = line.rstrip()
else:
dicfasta[Id] = line.rstrip()
# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")
# Create subsamples
subsample = []
subsample = random.sample(fNames, num)
# Take random items out of the list for the total number of samples required
for j in subsample:
print(j, file = outfile)
print(dicfasta[j], file = outfile)
infile.close()
outfile.close()
input("Done.")
我设计了一个小脚本,可以从原始文件中对 x 行进行子采样。原始文件是 fasta,每个序列有两行,程序提取那些 x 数量的序列(两行一起)。 这是它的样子:
#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))
# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")
# Define lists
fNames = []
fSeqs = []
# Extract fasta file into the two lists
for line in infile:
if line.startswith(">"):
fNames.append(line.rstrip())
else:
fSeqs.append(line.rstrip())
# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")
# Take random items out of the list for the total number of samples required
for j in range(num):
a = random.randint(0, (len(fNames)-1))
print(fNames.pop(a), file = outfile)
print(fSeqs.pop(a), file = outfile)
infile.close()
outfile.close()
input("Done.")
创建带有 ID 和核苷酸的列表(分别为第 1 行和第 2 行)进行得非常快,但打印出来需要很长时间。提取的数字最多可达 2M,但从 10000 开始变慢。
我想知道有没有办法让它更快。 .pop
是问题吗?如果我先创建一个唯一数字的随机列表然后提取它们会更快吗?
最后,终端在打印Done.
后没有返回"normal finished state",我也不知道为什么。对于我所有的其他脚本,我可以在完成后立即输入。
random.sample(在评论中建议)和字典使脚本更快。 这是最终脚本:
#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))
# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")
# Define list and dictionary
fNames = []
dicfasta = {}
# Extract fasta file into the two lists
for line in infile:
if line.startswith(">"):
fNames.append(line.rstrip())
Id = line.rstrip()
else:
dicfasta[Id] = line.rstrip()
# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")
# Create subsamples
subsample = []
subsample = random.sample(fNames, num)
# Take random items out of the list for the total number of samples required
for j in subsample:
print(j, file = outfile)
print(dicfasta[j], file = outfile)
infile.close()
outfile.close()
input("Done.")