如何使用Python提取和合并坐标？

Question

以下是3个序列的位置信息：序列名、起始位点和终止位点。我想说明网站的确切位置。例如785这个值实际上是指起始站点从位置27860291开始计数，以789结束，实际是27861079。谁能帮帮我？

代码

from Bio import SeqIO
from collections import defaultdict
output_file = open('B.bed','w')
with open('A.bed') as f:
    for line in f:
        name, start, stop = line.split()
        start =  int((name.split(':')[1].split('-')[0]))+line.index(start)
        stop = start + len(stop)            
        # print short_sequence_record.id, start, stop
    output_line ='%s\t%i\t%i\n' % \
        ((line.split(':')[0]),start,stop)
    output_file.write(output_line )
output_file.close()

A.bed

chr1:27860291-27862300  785 789
chr1:27860291-27862300  1539 1543
chr1:15504072-15506081  675 679

输出

chr1 15504096 15504099

预期输出

chr1 27861075 27861079
chr1 27861829 27861833
chr1 15504746 15504750

Answer 1

我认为您可以拆分字符串以创建一个列表，如下所示：

ch = '27860291-27862300  785 789'
ll = [int(i) for i in re.split(r'[ -]+',ch)]
print ll
[27860291, 27862300, 785, 789]
start = ll[0]+ll[2]-1
end = ll[0]+ll[3]-1
print start,end
27861075 27861079

Answer 2

由于 Python 使用了重要的空格，这是一个缩进错误。因为输出行的缩进级别错误，Python认为应该在循环外执行。

这是更新后的程序：

from collections import defaultdict
output_file = open('B.bed','w')
with open('A.bed') as f:
    for line in f:
        name, start, stop = line.split()
        start =  int((name.split(':')[1].split('-')[0]))+line.index(start)
        stop = start + len(stop)            
        # print short_sequence_record.id, start, stop
        output_line ='%s\t%i\t%i\n' % \
        ((line.split(':')[0]),start,stop)
        output_file.write(output_line )
output_file.close()

Answer 3

从您的预期输出来看，您似乎只是将每行的最后 2 个数字添加到第一个数字并减去一个。

import re  # regular expressions, not needed (alternatives: the `split` method) but convenient

re_pattern = r'[^:]*:(\d+)\D+(\d+)\D+(\d+)\D+(\d+)'
with open(inputfile) as fin:
    for line in fin:
        start, _, offset_start, offset_end = re.search(re_pattern, line).groups()
        print('chr1 {} {}'.format(int(start) + int(offset_start) - 1,int(start) + int(offset_end) - 1))

您的代码未产生所需输出的原因有多种，其中之一是您使用代码 stop = start + len(stop) 获取字符串的长度。您需要先明确转换为 int。您还需要注意缩进：现在，您只写一个 single 字符串 after for 循环完成。但是，您似乎想在每个运行通过该循环时都这样做。

Answer 4

我认为问题是 output_line ='%s\t%i\t%i\n' % ((line.split(':')[0]),start,stop) 行和后续行的缩进不足。

from Bio import SeqIO

output_file = open('B.bed','w')
with open('A.bed') as f:
    for line in f:
        name, start, stop = line.split()
        start =  int((name.split(':')[1].split('-')[0])) + line.index(start)
        stop = start + len(stop)            

        # print short_sequence_record.id, start, stop
        output_line ='%s\t%i\t%i\n' % ((line.split(':')[0]), start, stop)
        print output_line
        output_file.write(output_line)

output_file.close()

似乎生成了正确的输出，正在打印

chr1    27860315    27860318

chr1    27860315    27860319

chr1    15504096    15504099

此数据已登陆文件，如下所示：

with open('B.bed') as f:
    for line in f:
        print line,

产生

chr1    27860315    27860318
chr1    27860315    27860319
chr1    15504096    15504099

如何使用Python提取和合并坐标？

How to extract and merge the coordinates using Python?

python

extract

bioinformatics

extraction