如何使用Python根据坐标信息检索FASTA序列?
How to retrieve FASTA sequences according to coordinate information using Python?
我想根据bed文件B.bed获取序列,其中包含序列的坐标信息,将坐标与fasta文件A.fasta进行匹配,并根据B.bed 文件。 Fasta 文件是带有序列 ID 的普通文件,后面跟着它的核苷酸 ATCG。但是我得到下面的关键错误。谁能帮我?
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict
# read names and postions from bed file
positions = defaultdict(list)
with open('B.bed') as f:
for line in f:
name, start, stop = line.split()
positions[name].append((int(start), int(stop)))
# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('A.fasta'), 'fasta'))
# search for short sequences
short_seq_records = []
for name in positions:
for (start, stop) in positions[name]:
long_seq_record = records[name]
long_seq = long_seq_record.seq
alphabet = long_seq.alphabet
short_seq = str(long_seq)[start-1:stop]
short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
short_seq_records.append(short_seq_record)
# write to file
with open('C.fasta', 'w') as f:
SeqIO.write(short_seq_records,f, 'fasta')
A.fasta
>chr16:13361561-13361573
TAGTGGGTCAGAC
>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT
B.bed
chr6_apd_hap1 2165668 2165681
chr10 112612172 112612185
C.fasta
的预期输出
>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT
但我收到以下错误:
long_seq_record = records[name]
KeyError: 'chr6_apd_hap1'
在您的代码中,positions
是一个 defaultdict
,它具有 BED 文件中的名称作为键:
>>> print positions.keys()
['chr10', 'chr6_apd_hap1']
和records
是一个字典,以FASTA文件的headers为键,减去开头的>
,但它们仍然包括冒号和位置染色体:
>>> print records.keys()
['chr16:13361561-13361573', 'chr6_apd_hap1:2165669-2165681', 'chr10:112612173-112612185']
所以你首先需要转换你的 records
键来释放额外的信息,这样你就可以使用 positions
键来检索它们。您可以通过在创建 records
字典后添加以下行来执行此操作:
records = {key.split(':')[0]: value for (key, value) in records.iteritems()}
此外,您目前构建 short_seq_record
的方式实际上并不奏效。替换行
short_seq = str(long_seq)[start-1:stop]
short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
与:
short_seq = Seq(str(long_seq)[start-1:stop], alphabet)
short_seq_id = '{0}\t{1}\t{2}'.format(name, start, stop)
short_seq_record = SeqRecord(short_seq, id=short_seq_id, description='')
我想根据bed文件B.bed获取序列,其中包含序列的坐标信息,将坐标与fasta文件A.fasta进行匹配,并根据B.bed 文件。 Fasta 文件是带有序列 ID 的普通文件,后面跟着它的核苷酸 ATCG。但是我得到下面的关键错误。谁能帮我?
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict
# read names and postions from bed file
positions = defaultdict(list)
with open('B.bed') as f:
for line in f:
name, start, stop = line.split()
positions[name].append((int(start), int(stop)))
# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('A.fasta'), 'fasta'))
# search for short sequences
short_seq_records = []
for name in positions:
for (start, stop) in positions[name]:
long_seq_record = records[name]
long_seq = long_seq_record.seq
alphabet = long_seq.alphabet
short_seq = str(long_seq)[start-1:stop]
short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
short_seq_records.append(short_seq_record)
# write to file
with open('C.fasta', 'w') as f:
SeqIO.write(short_seq_records,f, 'fasta')
A.fasta
>chr16:13361561-13361573
TAGTGGGTCAGAC
>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT
B.bed
chr6_apd_hap1 2165668 2165681
chr10 112612172 112612185
C.fasta
的预期输出>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT
但我收到以下错误:
long_seq_record = records[name]
KeyError: 'chr6_apd_hap1'
在您的代码中,positions
是一个 defaultdict
,它具有 BED 文件中的名称作为键:
>>> print positions.keys()
['chr10', 'chr6_apd_hap1']
和records
是一个字典,以FASTA文件的headers为键,减去开头的>
,但它们仍然包括冒号和位置染色体:
>>> print records.keys()
['chr16:13361561-13361573', 'chr6_apd_hap1:2165669-2165681', 'chr10:112612173-112612185']
所以你首先需要转换你的 records
键来释放额外的信息,这样你就可以使用 positions
键来检索它们。您可以通过在创建 records
字典后添加以下行来执行此操作:
records = {key.split(':')[0]: value for (key, value) in records.iteritems()}
此外,您目前构建 short_seq_record
的方式实际上并不奏效。替换行
short_seq = str(long_seq)[start-1:stop]
short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
与:
short_seq = Seq(str(long_seq)[start-1:stop], alphabet)
short_seq_id = '{0}\t{1}\t{2}'.format(name, start, stop)
short_seq_record = SeqRecord(short_seq, id=short_seq_id, description='')