如何使用 Python 根据 BED 文件格式更改坐标格式？

Question

我有两个 fasta 文件，我想将 FileB.fasta 中的较短序列与 FileA.fasta 中的原始序列进行匹配，以获得其坐标或位置。但是我的输出格式不正确。谁能帮我？

FileA.fasta

>chr1:2000-2019
ACGTCGATCGGTCGACGTGC

FileB.fasta

>chr1:2000-2019
GATCGG

FileC.bed

chr1:2000-2019 6 11

代码

from Bio import SeqIO
output_file = open('fileC.bed','w')
for long_sequence_record in SeqIO.parse(open('fileA.fasta'), 'fasta'):
    long_sequence = str(long_sequence_record.seq)
    for short_sequence_record in SeqIO.parse(open('fileB.fasta'), 'fasta'):
        short_sequence = str(short_sequence_record.seq)
        if short_sequence in long_sequence:
            start = long_sequence.index(short_sequence) + 1
            stop = start + len(short_sequence) - 1
           # print short_sequence_record.id, start, stop
            output_line ='%s\t%i\t%i\n' % \
            (short_sequence_record.id,start,stop)
            output_file.write(output_line )
output_file.close()

FileC.bed

的期望输出

chr1 2005 2011

Answer 1

好吧，您正在为查找较短序列的索引添加 1 -

start = long_sequence.index(short_sequence) + 1 <--- notice the +1

不要那样做，应该没问题。也不要对 stop 变量执行 -1。

您应该改为添加 ID 中的起始序列号。

例子-

start = long_sequence.index(short_sequence) + int((short_sequence_record.id.split(':')[1].split('-')[0]))
stop = start + len(short_sequence)

对于记录的id，如果你不想要:之前的任何东西，那么你应该在:拆分id并取左边部分（0索引字符串拆分后）。

例子-

output_line ='%s\t%i\t%i\n' % \
        ((short_sequence_record.id.split(':')[0]),start,stop)
        output_file.write(output_line )

Answer 2

更通用的解决方案：

获取您的参考序列数据并使用 relevant UCSC Kent Tools 使其为 BLAT 做好准备。
执行 BLAT search 将短查询字符串与参考数据对齐并获取 PSL 文件。
Convert the PSL output 睡觉。

如何使用 Python 根据 BED 文件格式更改坐标格式？

How to change the coordinates format according to BED file format using Python?

python

bioinformatics

fasta