处理 Blast 结果文件 Python
Manipulate Blast Result File Python
我写了一个 Biopython 脚本,它给我结果,我有一个这样的文件:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE646921.1_20 [383-240]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
如何在单个评论行中获得结果:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
我尝试使用正则表达式,但它不起作用。感谢您的回答。
okkey 没有使用 'SeqIO.to_dict()' 不知道,也许其他人会 post 使用 [已编辑
的答案
record_dict = SeqIO.to_dict(SeqIO.parse(fasta_file , "fasta"))
将筹集
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'NW_020169394.1_41'
我的输入 'res.txt' 文件是:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE646921.1_20 [383-240]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
我的代码是:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Jun 28 15:57:59 2021
@author: Pietro
"""
from Bio import SeqIO
fasta_file = 'res.txt'
sequences = {}
sequences2 = {}
sequa = SeqIO.parse(fasta_file, "fasta")
count = 0
for seq_record in sequa :
sequences[count] = seq_record
count +=1
for key,seq_record in sequences.items():
countz = 0
if seq_record.seq not in [valuez.seq for keiz,valuez in sequences2.items()]:
sequences2[countz] = seq_record
countz += 1
else:
for keiz,valuez in sequences2.items():
if seq_record.seq == valuez.seq:
new = valuez.description.replace('valuez.id', '')
newnew = new.replace(valuez.name , '')
sequences2[keiz].description = seq_record.description + (newnew)
print('########################')
print(sequences2)
with open('modified_res.text', 'w') as handle:
for key,value in sequences2.items():
print(value)
SeqIO.write(value, handle, 'fasta')
我的输出'modified_res.text'文件是:
>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720] [10497-10619]|KE646921.1_20 [383-240] [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
无法删除重复的 [10497-10619]
并且顺序与您的不同
抱歉,我已经尽力了
此外,如果您的序列可以通过特定的 seqrecord 格式解析(即 genbank 等,不是专家,可以用不同的方式使用它们的描述,请告诉我)。
在 FASTA 标签中,第一个 space 之前的所有内容都是 ID,应该是唯一的。在您的示例中情况并非如此,因此 SeqIO.to_dict()
将不起作用。相反,将序列映射回它们的标签,然后组合它们:
from Bio import SeqIO
from collections import defaultdict
seq2label = defaultdict(list)
for record in SeqIO.parse('result.fa', 'fasta'):
seq2label[str(record.seq)].append(record.description)
for sequence, labels in seq2label.items():
combined_label = ' | '.join(labels[:1] + [label.split('|')[1] for label in labels[1:]])
print(f'>{combined_label}\n{sequence}\n')
输出:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
我写了一个 Biopython 脚本,它给我结果,我有一个这样的文件:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE646921.1_20 [383-240]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
如何在单个评论行中获得结果:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
我尝试使用正则表达式,但它不起作用。感谢您的回答。
okkey 没有使用 'SeqIO.to_dict()' 不知道,也许其他人会 post 使用 [已编辑
的答案record_dict = SeqIO.to_dict(SeqIO.parse(fasta_file , "fasta"))
将筹集
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'NW_020169394.1_41'
我的输入 'res.txt' 文件是:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE646921.1_20 [383-240]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
我的代码是:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Jun 28 15:57:59 2021
@author: Pietro
"""
from Bio import SeqIO
fasta_file = 'res.txt'
sequences = {}
sequences2 = {}
sequa = SeqIO.parse(fasta_file, "fasta")
count = 0
for seq_record in sequa :
sequences[count] = seq_record
count +=1
for key,seq_record in sequences.items():
countz = 0
if seq_record.seq not in [valuez.seq for keiz,valuez in sequences2.items()]:
sequences2[countz] = seq_record
countz += 1
else:
for keiz,valuez in sequences2.items():
if seq_record.seq == valuez.seq:
new = valuez.description.replace('valuez.id', '')
newnew = new.replace(valuez.name , '')
sequences2[keiz].description = seq_record.description + (newnew)
print('########################')
print(sequences2)
with open('modified_res.text', 'w') as handle:
for key,value in sequences2.items():
print(value)
SeqIO.write(value, handle, 'fasta')
我的输出'modified_res.text'文件是:
>NW_020169394.1_41 [10497-10619]|KE647277.1_227 [70875-70720] [10497-10619]|KE646921.1_20 [383-240] [10497-10619]|KE647364.1_346 [38084-37959]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET
无法删除重复的 [10497-10619]
并且顺序与您的不同
抱歉,我已经尽力了
此外,如果您的序列可以通过特定的 seqrecord 格式解析(即 genbank 等,不是专家,可以用不同的方式使用它们的描述,请告诉我)。
在 FASTA 标签中,第一个 space 之前的所有内容都是 ID,应该是唯一的。在您的示例中情况并非如此,因此 SeqIO.to_dict()
将不起作用。相反,将序列映射回它们的标签,然后组合它们:
from Bio import SeqIO
from collections import defaultdict
seq2label = defaultdict(list)
for record in SeqIO.parse('result.fa', 'fasta'):
seq2label[str(record.seq)].append(record.description)
for sequence, labels in seq2label.items():
combined_label = ' | '.join(labels[:1] + [label.split('|')[1] for label in labels[1:]])
print(f'>{combined_label}\n{sequence}\n')
输出:
>NW_020169394.1_41 [10497-10619]|KE647364.1_346 [38084-37959] | KE646921.1_20 [383-240] | KE647277.1_227 [70875-70720]
MDQLSRKLNLTYLKVGILTSQNEFVTKHLLIIKGLKIFTET