Python: 使用 Bed 文件从 FASTA 文件中提取 DNA 序列
Python: Extract DNA sequence from FASTA file using Bed file
请问如何从fasta文件中提取dna序列?我尝试了 bedtools 和 samtools。 Bedtools getfasta 做得很好,但对于我的一些文件 return "warning: chromosome was not found in fasta file" 但事实是 bed 文件中的染色体名称和 fasta 完全相同。我正在寻找 python 可以为我完成此任务的其他替代方案。
Bed 文件:
chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9
Fasta 文件:
>chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC
期望的输出:
>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC
BioPython
就是你要用的:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict
# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
for line in f:
name, start, stop = line.split()
positions[name].append((int(start), int(stop)))
# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))
# search for short sequences
short_seq_records = []
for name in positions:
for (start, stop) in positions[name]:
long_seq_record = records[name]
long_seq = long_seq_record.seq
alphabet = long_seq.alphabet
short_seq = str(long_seq)[start-1:stop]
short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
short_seq_records.append(short_seq_record)
# write to file
with open('output.fasta', 'w') as f:
SeqIO.write(short_seq_records, f, 'fasta')
尝试,使用:
from Bio import SeqIO
#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])
output = open("output.fasta", "w")
for line in open("input.bed"):
id, begin, end = line.split()
if id in dict_fasta:
#[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
#[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
else:
print id + " don't found"
output.close()
你明白了,染色体的第一个碱基编号为 1:
>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC
你知道,染色体中的第一个碱基编号为 0:
>chr1:117223140-117223856
GTGGG
>chr1:117223140-117223856
GGGCT
您的床文件需要以制表符分隔,以便床工具使用它。用制表符替换冒号、破折号和空格。
BedTools 文档页面显示 "bedtools requires that all BED input files (and input received from stdin) are tab-delimited." BedTools。
请问如何从fasta文件中提取dna序列?我尝试了 bedtools 和 samtools。 Bedtools getfasta 做得很好,但对于我的一些文件 return "warning: chromosome was not found in fasta file" 但事实是 bed 文件中的染色体名称和 fasta 完全相同。我正在寻找 python 可以为我完成此任务的其他替代方案。
Bed 文件:
chr1:117223140-117223856 3 7
chr1:117223140-117223856 5 9
Fasta 文件:
>chr1:117223140-117223856
CGCGTGGGCTAGGGGCTAGCCCC
期望的输出:
>chr1:117223140-117223856
CGTGG
>chr1:117223140-117223856
TGGGC
BioPython
就是你要用的:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict
# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
for line in f:
name, start, stop = line.split()
positions[name].append((int(start), int(stop)))
# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))
# search for short sequences
short_seq_records = []
for name in positions:
for (start, stop) in positions[name]:
long_seq_record = records[name]
long_seq = long_seq_record.seq
alphabet = long_seq.alphabet
short_seq = str(long_seq)[start-1:stop]
short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
short_seq_records.append(short_seq_record)
# write to file
with open('output.fasta', 'w') as f:
SeqIO.write(short_seq_records, f, 'fasta')
尝试,使用:
from Bio import SeqIO
#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])
output = open("output.fasta", "w")
for line in open("input.bed"):
id, begin, end = line.split()
if id in dict_fasta:
#[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
#[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
else:
print id + " don't found"
output.close()
你明白了,染色体的第一个碱基编号为 1:
>chr1:117223140-117223856 CGTGG >chr1:117223140-117223856 TGGGC
你知道,染色体中的第一个碱基编号为 0:
>chr1:117223140-117223856 GTGGG >chr1:117223140-117223856 GGGCT
您的床文件需要以制表符分隔,以便床工具使用它。用制表符替换冒号、破折号和空格。
BedTools 文档页面显示 "bedtools requires that all BED input files (and input received from stdin) are tab-delimited." BedTools。