将fasta文件与充满序列id的txt文件进行比较
Compare fasta file with txt file full of sequence id
我需要帮助,因为我卡住了。
我有一个带有序列 ID 的 txt 文件,它
看起来像这样 -->
tr|K9RTD0|K9RTD0_SYNP3
tr|K9RSV3|K9RSV3_SYNP3
tr|K9RRE8|K9RRE8_SYNP3
tr|K9RMU9|K9RMU9_SYNP3
然后我有一个典型的fasta文件。
>sp|P00115|CYC6_SYNP3 Cytochrome c6 OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=petJ PE=1 SV=2
MKTLLTILALTLVTLTTWLSTPAFAADIADGAKVFSANCAACHMGGGNVVMANKTLKKEA
LEQFGMNSADAIMYQVQNGKNAMPAFGGRLSEAQIENVAAYVLDQSSKNWAG
>tr|K9RTH7|K9RTH7_SYNP3 N-acyl-D-glucosamine 2-epimerase OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_2130 PE=4 SV=1
MAPQINFPFSDLIAGYVTSYDTETDIFGLKTSDGREFPVKLSPMAYAKVIQNFDEGYPDA
TSTMRAWLTPGRFLFVYGVFYPDTDVFDAKQVVFAGKKEDDYVFEKQDWWIQQINALGKF
YVKAQFGQEEIDYRNYRTDLSVSGERSSVKFRQETDTISRLVYGFATAFMMTGDEVFLEA
AEKGTEYLRDHMRFVDRDEDIIYWYHGIDVQGEKELKIFASEFGDDYDAIPAYEQIYALA
GPIQTYRCTGDPRILSDAEQTIKLFDKFFLDQSEYGGYFSHIDPLMLDPRSDSLGRNKAR
KNWNSVGDHAPAYLINLWLATGEQKYADMLEYTFDTIEKYFPDYENSPFVQERFYEDWSH
DTTWGWQQNRAVVGHNLKIAWNLMRMQSLKPKEQYVGLAQKIADLMPSVGSDQQRGGWSD
TVERLLTNNSKFHQFVWHDRKAWWQQEQAILAYLILGGILEHDDYHRLGREAAAFYNAWF
LDLEDGGVYFNVLANGISYLARGNERAKGSHSMSGYHSFELCYLAAVYTNFLITKHPMDF
YFKPLPNGFPDRILRVSPDILPPGSILLESVEIDGKAYTDFDSQALTVKLPETKERVKVK
VRLAPKS
>tr|K9RXQ9|K9RXQ9_SYNP3 Uncharacterized protein OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_3008 PE=4 SV=1
MKVEILKKRLNKECPMTTTRMPEDVIQELKQIASLLVFWGYQPLIGADIGQGLRTDLEQL
EDDKVSALVASLKRHRVSDEVLQTALMETTIN
我需要比较这两个文件,并根据id找到序列的描述并打印出来。
我的代码:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys
p = "proteome.fasta"
file = "reference.txt"
out = "jopik.txt"
with open(out, "w") as o:
sys.stdout = o
for seq_record in SeqIO.parse(open(p, mode = "r"),"fasta"):
seq_record.description=' '.join(seq_record.description.split()[1:])
with open(file,"r") as f:
line = f.readlines()
print(line)
if (seq_record.id == line):
i = seq_record.description
print(i)
你只是缺少某种循环 for x in y:
此外,文件处理程序在 Python 中是可迭代的(迭代 by-lines 用于 non-binary 模式),这将节省你不必在开始迭代之前将整个文件加载到内存中(就像 .readlines()
那样)
# load first file and create a helpful structure
compare_dict = {}
with open("reference.txt") as fh:
for line in fh:
if line: # throw out empty lines, could do a stricter compare
compare_dict[line.strip()] = None
# form a tuple of possible prefixes
compare_tuple = tuple(">" + a for a in compare_dict.keys())
with open("proteome.fasta") as fh:
for line_no, line in enumerate(fh, 1): # lines start at 1, not 0
if line.startswith(compare_tuple)
key, value = line.split(" ", 1)
key = key[1:] # strip ">" from prefix
compare_dict[key] = value
print("found {} on L{}: {}".format(key, line_no, value))
# optionally display keys which were not in your .fasta file
for key, value in compare_dict.items():
if value is None:
print("failed to find a definition for {}".format(key))
我需要帮助,因为我卡住了。 我有一个带有序列 ID 的 txt 文件,它 看起来像这样 -->
tr|K9RTD0|K9RTD0_SYNP3
tr|K9RSV3|K9RSV3_SYNP3
tr|K9RRE8|K9RRE8_SYNP3
tr|K9RMU9|K9RMU9_SYNP3
然后我有一个典型的fasta文件。
>sp|P00115|CYC6_SYNP3 Cytochrome c6 OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=petJ PE=1 SV=2
MKTLLTILALTLVTLTTWLSTPAFAADIADGAKVFSANCAACHMGGGNVVMANKTLKKEA
LEQFGMNSADAIMYQVQNGKNAMPAFGGRLSEAQIENVAAYVLDQSSKNWAG
>tr|K9RTH7|K9RTH7_SYNP3 N-acyl-D-glucosamine 2-epimerase OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_2130 PE=4 SV=1
MAPQINFPFSDLIAGYVTSYDTETDIFGLKTSDGREFPVKLSPMAYAKVIQNFDEGYPDA
TSTMRAWLTPGRFLFVYGVFYPDTDVFDAKQVVFAGKKEDDYVFEKQDWWIQQINALGKF
YVKAQFGQEEIDYRNYRTDLSVSGERSSVKFRQETDTISRLVYGFATAFMMTGDEVFLEA
AEKGTEYLRDHMRFVDRDEDIIYWYHGIDVQGEKELKIFASEFGDDYDAIPAYEQIYALA
GPIQTYRCTGDPRILSDAEQTIKLFDKFFLDQSEYGGYFSHIDPLMLDPRSDSLGRNKAR
KNWNSVGDHAPAYLINLWLATGEQKYADMLEYTFDTIEKYFPDYENSPFVQERFYEDWSH
DTTWGWQQNRAVVGHNLKIAWNLMRMQSLKPKEQYVGLAQKIADLMPSVGSDQQRGGWSD
TVERLLTNNSKFHQFVWHDRKAWWQQEQAILAYLILGGILEHDDYHRLGREAAAFYNAWF
LDLEDGGVYFNVLANGISYLARGNERAKGSHSMSGYHSFELCYLAAVYTNFLITKHPMDF
YFKPLPNGFPDRILRVSPDILPPGSILLESVEIDGKAYTDFDSQALTVKLPETKERVKVK
VRLAPKS
>tr|K9RXQ9|K9RXQ9_SYNP3 Uncharacterized protein OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_3008 PE=4 SV=1
MKVEILKKRLNKECPMTTTRMPEDVIQELKQIASLLVFWGYQPLIGADIGQGLRTDLEQL
EDDKVSALVASLKRHRVSDEVLQTALMETTIN
我需要比较这两个文件,并根据id找到序列的描述并打印出来。 我的代码:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys
p = "proteome.fasta"
file = "reference.txt"
out = "jopik.txt"
with open(out, "w") as o:
sys.stdout = o
for seq_record in SeqIO.parse(open(p, mode = "r"),"fasta"):
seq_record.description=' '.join(seq_record.description.split()[1:])
with open(file,"r") as f:
line = f.readlines()
print(line)
if (seq_record.id == line):
i = seq_record.description
print(i)
你只是缺少某种循环 for x in y:
此外,文件处理程序在 Python 中是可迭代的(迭代 by-lines 用于 non-binary 模式),这将节省你不必在开始迭代之前将整个文件加载到内存中(就像 .readlines()
那样)
# load first file and create a helpful structure
compare_dict = {}
with open("reference.txt") as fh:
for line in fh:
if line: # throw out empty lines, could do a stricter compare
compare_dict[line.strip()] = None
# form a tuple of possible prefixes
compare_tuple = tuple(">" + a for a in compare_dict.keys())
with open("proteome.fasta") as fh:
for line_no, line in enumerate(fh, 1): # lines start at 1, not 0
if line.startswith(compare_tuple)
key, value = line.split(" ", 1)
key = key[1:] # strip ">" from prefix
compare_dict[key] = value
print("found {} on L{}: {}".format(key, line_no, value))
# optionally display keys which were not in your .fasta file
for key, value in compare_dict.items():
if value is None:
print("failed to find a definition for {}".format(key))