使用 python 的两个 fasta 文件的交集
Intersection of two fasta files using python
我有两个大的 fasta 文件——它们的结构不同(如下所示),但两个文件中读取的 headers(以 @ 开头)相同:
文件 1
>MN00153:75:000H37WNG:1:12106:12990:1333
AAAACCCC
>MN00153:75:000H37WNG:1:12106:21652:2374
AAAAGGGG
>MN00153:75:000H37WNG:1:12106:21652:2366
AGGGGGTT
文件 2
>MN00153:75:000H37WNG:1:12106:12990:1333
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAGATCTCGCCC
>MN00153:75:000H37WNG:1:12106:21652:2374
AGATCTCGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>MN00153:75:000H37WNG:1:12106:21652:2366
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
我使用脚本从 file1 的 headers
(键)和 reads
(值)创建了一个字典:
from Bio import SeqIO
dict={}
with open ('index2.fasta', 'r') as file1:
for record in SeqIO.parse(file1, 'fasta'):
dict[str(record.id)] = str(record.seq)
我所做的是遍历 file2 中的读取,如果 'AGATCTCG'
字符串在读取中,我将那些读取的 headers 保存在列表中。
现在我遇到的问题是我想根据 dictionary
和 list
制作 file2 的 sub-file。如果我的列表中的项目作为键存在于我的字典中并且如果值为 'AAAACCCC'
那么输出应该是 MN00153:75:000H37WNG:1:12106:12990:1333
但我同时得到 MN00153:75:000H37WNG:1:12106:12990:1333
和 MN00153:75:000H37WNG:1:12106:21652:2374
ATTACTCG_ids=[]
with open ('Read1.fasta', 'r') as file2:
for record in SeqIO.parse(file2, 'fasta'):
if 'AGATCTCG' in record.seq:
ATTACTCG_ids.append(record.id)
for i in ATTACTCG_ids:
if dict.get(i) == 'AAAACCCC':
final = record.format('fasta')
print(final)
有人可以帮我解决这个问题吗?
问题似乎是每次都会迭代累积的 ID 列表。您可以单独检查每条记录。大概是这样的。
with open ('Read1.fasta', 'r') as file2:
for record in SeqIO.parse(file2, 'fasta'):
if 'AGATCTCG' in record.seq and dict.get(record.id) == 'AAAACCCC':
final = record.format('fasta')
print(final)
我有两个大的 fasta 文件——它们的结构不同(如下所示),但两个文件中读取的 headers(以 @ 开头)相同:
文件 1
>MN00153:75:000H37WNG:1:12106:12990:1333
AAAACCCC
>MN00153:75:000H37WNG:1:12106:21652:2374
AAAAGGGG
>MN00153:75:000H37WNG:1:12106:21652:2366
AGGGGGTT
文件 2
>MN00153:75:000H37WNG:1:12106:12990:1333
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAGATCTCGCCC
>MN00153:75:000H37WNG:1:12106:21652:2374
AGATCTCGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>MN00153:75:000H37WNG:1:12106:21652:2366
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
我使用脚本从 file1 的 headers
(键)和 reads
(值)创建了一个字典:
from Bio import SeqIO
dict={}
with open ('index2.fasta', 'r') as file1:
for record in SeqIO.parse(file1, 'fasta'):
dict[str(record.id)] = str(record.seq)
我所做的是遍历 file2 中的读取,如果 'AGATCTCG'
字符串在读取中,我将那些读取的 headers 保存在列表中。
现在我遇到的问题是我想根据 dictionary
和 list
制作 file2 的 sub-file。如果我的列表中的项目作为键存在于我的字典中并且如果值为 'AAAACCCC'
那么输出应该是 MN00153:75:000H37WNG:1:12106:12990:1333
但我同时得到 MN00153:75:000H37WNG:1:12106:12990:1333
和 MN00153:75:000H37WNG:1:12106:21652:2374
ATTACTCG_ids=[]
with open ('Read1.fasta', 'r') as file2:
for record in SeqIO.parse(file2, 'fasta'):
if 'AGATCTCG' in record.seq:
ATTACTCG_ids.append(record.id)
for i in ATTACTCG_ids:
if dict.get(i) == 'AAAACCCC':
final = record.format('fasta')
print(final)
有人可以帮我解决这个问题吗?
问题似乎是每次都会迭代累积的 ID 列表。您可以单独检查每条记录。大概是这样的。
with open ('Read1.fasta', 'r') as file2:
for record in SeqIO.parse(file2, 'fasta'):
if 'AGATCTCG' in record.seq and dict.get(record.id) == 'AAAACCCC':
final = record.format('fasta')
print(final)