如果 headers 包含某些字符串,如何基于 headers 对 fasta 文件进行子采样?
how to subsample a fasta file based on the headers if headers contain certain strings?
我有一个这样的 fasta 文件:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
我有一个包含数字列表的文本文件:
373248686
930076459
296280703
......
我想执行以下操作:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
如何在 python 中执行此操作?这看起来很简单,只需一些 for 循环就可以完成这项工作,但不知何故我无法做到这一点,如果我的文件真的很大,在另一个循环中循环可能需要很长时间。这是我尝试过的:
from Bio import SeqIO
import sys
wanted = []
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq = []
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
收到此错误:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
有更好的方法吗?
谢谢!
使用这样的程序
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
然后您可以从命令行调用它
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
将您的 "keeper" ID 导入字典而不是列表,这样会快得多,因为不必搜索列表数千次。
keepers = {}
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
列表理解生成一个列表,因此您不需要附加到另一个列表。
keeper_seqs = [x for x in seqiter if x.id in keepers]
对于较大的文件,您需要遍历 seqiter 并一次写入一个条目以避免内存问题。
你也不应该没有充分的理由分配给 sys.stdout
,如果你想输出到 STDOUT
只需使用 print
或 sys.stdout.write()
.
我有一个这样的 fasta 文件:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
我有一个包含数字列表的文本文件:
373248686
930076459
296280703
......
我想执行以下操作:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
如何在 python 中执行此操作?这看起来很简单,只需一些 for 循环就可以完成这项工作,但不知何故我无法做到这一点,如果我的文件真的很大,在另一个循环中循环可能需要很长时间。这是我尝试过的:
from Bio import SeqIO
import sys
wanted = []
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq = []
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
收到此错误:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
有更好的方法吗?
谢谢!
使用这样的程序
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
然后您可以从命令行调用它
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
将您的 "keeper" ID 导入字典而不是列表,这样会快得多,因为不必搜索列表数千次。
keepers = {}
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
列表理解生成一个列表,因此您不需要附加到另一个列表。
keeper_seqs = [x for x in seqiter if x.id in keepers]
对于较大的文件,您需要遍历 seqiter 并一次写入一个条目以避免内存问题。
你也不应该没有充分的理由分配给 sys.stdout
,如果你想输出到 STDOUT
只需使用 print
或 sys.stdout.write()
.