如何使用Python提取和合并坐标?
How to extract and merge the coordinates using Python?
以下是3个序列的位置信息:序列名、起始位点和终止位点。我想说明网站的确切位置。例如785这个值实际上是指起始站点从位置27860291开始计数,以789结束,实际是27861079。谁能帮帮我?
代码
from Bio import SeqIO
from collections import defaultdict
output_file = open('B.bed','w')
with open('A.bed') as f:
for line in f:
name, start, stop = line.split()
start = int((name.split(':')[1].split('-')[0]))+line.index(start)
stop = start + len(stop)
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % \
((line.split(':')[0]),start,stop)
output_file.write(output_line )
output_file.close()
A.bed
chr1:27860291-27862300 785 789
chr1:27860291-27862300 1539 1543
chr1:15504072-15506081 675 679
输出
chr1 15504096 15504099
预期输出
chr1 27861075 27861079
chr1 27861829 27861833
chr1 15504746 15504750
我认为您可以拆分字符串以创建一个列表,如下所示:
ch = '27860291-27862300 785 789'
ll = [int(i) for i in re.split(r'[ -]+',ch)]
print ll
[27860291, 27862300, 785, 789]
start = ll[0]+ll[2]-1
end = ll[0]+ll[3]-1
print start,end
27861075 27861079
由于 Python 使用了重要的空格,这是一个缩进错误。因为输出行的缩进级别错误,Python认为应该在循环外执行。
这是更新后的程序:
from collections import defaultdict
output_file = open('B.bed','w')
with open('A.bed') as f:
for line in f:
name, start, stop = line.split()
start = int((name.split(':')[1].split('-')[0]))+line.index(start)
stop = start + len(stop)
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % \
((line.split(':')[0]),start,stop)
output_file.write(output_line )
output_file.close()
从您的预期输出来看,您似乎只是将每行的最后 2 个数字添加到第一个数字并减去一个。
import re # regular expressions, not needed (alternatives: the `split` method) but convenient
re_pattern = r'[^:]*:(\d+)\D+(\d+)\D+(\d+)\D+(\d+)'
with open(inputfile) as fin:
for line in fin:
start, _, offset_start, offset_end = re.search(re_pattern, line).groups()
print('chr1 {} {}'.format(int(start) + int(offset_start) - 1,int(start) + int(offset_end) - 1))
您的代码未产生所需输出的原因有多种,其中之一是您使用代码 stop = start + len(stop)
获取字符串的长度。您需要先明确转换为 int。您还需要注意缩进:现在,您只写一个 single 字符串 after for 循环完成。但是,您似乎想在每个 运行 通过该循环时都这样做。
我认为问题是 output_line ='%s\t%i\t%i\n' % ((line.split(':')[0]),start,stop)
行和后续行的缩进不足。
from Bio import SeqIO
output_file = open('B.bed','w')
with open('A.bed') as f:
for line in f:
name, start, stop = line.split()
start = int((name.split(':')[1].split('-')[0])) + line.index(start)
stop = start + len(stop)
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % ((line.split(':')[0]), start, stop)
print output_line
output_file.write(output_line)
output_file.close()
似乎生成了正确的输出,正在打印
chr1 27860315 27860318
chr1 27860315 27860319
chr1 15504096 15504099
此数据已登陆文件,如下所示:
with open('B.bed') as f:
for line in f:
print line,
产生
chr1 27860315 27860318
chr1 27860315 27860319
chr1 15504096 15504099
以下是3个序列的位置信息:序列名、起始位点和终止位点。我想说明网站的确切位置。例如785这个值实际上是指起始站点从位置27860291开始计数,以789结束,实际是27861079。谁能帮帮我?
代码
from Bio import SeqIO
from collections import defaultdict
output_file = open('B.bed','w')
with open('A.bed') as f:
for line in f:
name, start, stop = line.split()
start = int((name.split(':')[1].split('-')[0]))+line.index(start)
stop = start + len(stop)
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % \
((line.split(':')[0]),start,stop)
output_file.write(output_line )
output_file.close()
A.bed
chr1:27860291-27862300 785 789
chr1:27860291-27862300 1539 1543
chr1:15504072-15506081 675 679
输出
chr1 15504096 15504099
预期输出
chr1 27861075 27861079
chr1 27861829 27861833
chr1 15504746 15504750
我认为您可以拆分字符串以创建一个列表,如下所示:
ch = '27860291-27862300 785 789'
ll = [int(i) for i in re.split(r'[ -]+',ch)]
print ll
[27860291, 27862300, 785, 789]
start = ll[0]+ll[2]-1
end = ll[0]+ll[3]-1
print start,end
27861075 27861079
由于 Python 使用了重要的空格,这是一个缩进错误。因为输出行的缩进级别错误,Python认为应该在循环外执行。
这是更新后的程序:
from collections import defaultdict
output_file = open('B.bed','w')
with open('A.bed') as f:
for line in f:
name, start, stop = line.split()
start = int((name.split(':')[1].split('-')[0]))+line.index(start)
stop = start + len(stop)
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % \
((line.split(':')[0]),start,stop)
output_file.write(output_line )
output_file.close()
从您的预期输出来看,您似乎只是将每行的最后 2 个数字添加到第一个数字并减去一个。
import re # regular expressions, not needed (alternatives: the `split` method) but convenient
re_pattern = r'[^:]*:(\d+)\D+(\d+)\D+(\d+)\D+(\d+)'
with open(inputfile) as fin:
for line in fin:
start, _, offset_start, offset_end = re.search(re_pattern, line).groups()
print('chr1 {} {}'.format(int(start) + int(offset_start) - 1,int(start) + int(offset_end) - 1))
您的代码未产生所需输出的原因有多种,其中之一是您使用代码 stop = start + len(stop)
获取字符串的长度。您需要先明确转换为 int。您还需要注意缩进:现在,您只写一个 single 字符串 after for 循环完成。但是,您似乎想在每个 运行 通过该循环时都这样做。
我认为问题是 output_line ='%s\t%i\t%i\n' % ((line.split(':')[0]),start,stop)
行和后续行的缩进不足。
from Bio import SeqIO
output_file = open('B.bed','w')
with open('A.bed') as f:
for line in f:
name, start, stop = line.split()
start = int((name.split(':')[1].split('-')[0])) + line.index(start)
stop = start + len(stop)
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % ((line.split(':')[0]), start, stop)
print output_line
output_file.write(output_line)
output_file.close()
似乎生成了正确的输出,正在打印
chr1 27860315 27860318
chr1 27860315 27860319
chr1 15504096 15504099
此数据已登陆文件,如下所示:
with open('B.bed') as f:
for line in f:
print line,
产生
chr1 27860315 27860318
chr1 27860315 27860319
chr1 15504096 15504099