如何提取P匹配结果中的坐标?
How to extract coordinates in P-match result?
从这个 link http://www.gene-regulation.com/cgi-bin/pub/programs/pmatch/bin/p-match.cgi 产生的结果我需要处理以便只获得序列 ID、开始和结束位置。我可以通过哪些方式从结果中提取坐标信息?下面是示例结果。
Scanning sequence ID: BEST1_HUMAN
150 (-) 1.000 0.997 GGAAAggccc R05891
354 (+) 0.988 0.981 gtgtAGACAtt R06227
V$CREL_01c-RelV$EVI1_05Evi-1
Scanning sequence ID: 4F2_HUMAN
365 (+) 1.000 1.000 gggacCTACA R05884
789 (-) 1.000 1.000 gcgCGAAA R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F
预期输出:
Sequence ID start end
(end site是添加到start site的短序列GGAAAggccc的个数)。
BEST1_HUMAN 150 160
BEST1_HUMAN 354 365
4F2_HUMAN 365 375
4F2_HUMAN 789 797
谁能帮帮我?
使用 this answer 中的代码段将您的结果分成大小均匀的块并提取所需数据:
def chunks(l, n):
#Generator to yield n sized chunks from l
for i in xrange(0, len(l), n):
yield l[i: i + n]
with open('p_match.txt') as f:
for chunk in chunks(f.readlines(), 6):
sequence_id = chunk[0].split()[-1].strip()
for i in (2,3):
start = int(chunk[i].split()[0].strip())
sequence = chunk[i].split()[-2].strip()
stop = start + len(sequence)
print sequence_id, start, stop
编辑:显然结果可以包含可变数量的起始位置,因此上述拆分为大小均匀的块的解决方案不起作用。然后您可以使用正则表达式路由或逐行浏览文件:
with open('p_match.txt') as f:
text = f.read()
chunks = text.split('Scanning sequence ID:')
for chunk in chunks:
if chunk:
lines = chunk.split('\n')
sequence_id = lines[0].strip()
for line in lines:
if line.startswith(' '):
start = int(line.split()[0].strip())
sequence = line.split()[-2].strip()
stop = start + len(sequence)
print sequence_id, start, stop
从这个 link http://www.gene-regulation.com/cgi-bin/pub/programs/pmatch/bin/p-match.cgi 产生的结果我需要处理以便只获得序列 ID、开始和结束位置。我可以通过哪些方式从结果中提取坐标信息?下面是示例结果。
Scanning sequence ID: BEST1_HUMAN
150 (-) 1.000 0.997 GGAAAggccc R05891
354 (+) 0.988 0.981 gtgtAGACAtt R06227
V$CREL_01c-RelV$EVI1_05Evi-1
Scanning sequence ID: 4F2_HUMAN
365 (+) 1.000 1.000 gggacCTACA R05884
789 (-) 1.000 1.000 gcgCGAAA R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F
预期输出:
Sequence ID start end
(end site是添加到start site的短序列GGAAAggccc的个数)。
BEST1_HUMAN 150 160
BEST1_HUMAN 354 365
4F2_HUMAN 365 375
4F2_HUMAN 789 797
谁能帮帮我?
使用 this answer 中的代码段将您的结果分成大小均匀的块并提取所需数据:
def chunks(l, n):
#Generator to yield n sized chunks from l
for i in xrange(0, len(l), n):
yield l[i: i + n]
with open('p_match.txt') as f:
for chunk in chunks(f.readlines(), 6):
sequence_id = chunk[0].split()[-1].strip()
for i in (2,3):
start = int(chunk[i].split()[0].strip())
sequence = chunk[i].split()[-2].strip()
stop = start + len(sequence)
print sequence_id, start, stop
编辑:显然结果可以包含可变数量的起始位置,因此上述拆分为大小均匀的块的解决方案不起作用。然后您可以使用正则表达式路由或逐行浏览文件:
with open('p_match.txt') as f:
text = f.read()
chunks = text.split('Scanning sequence ID:')
for chunk in chunks:
if chunk:
lines = chunk.split('\n')
sequence_id = lines[0].strip()
for line in lines:
if line.startswith(' '):
start = int(line.split()[0].strip())
sequence = line.split()[-2].strip()
stop = start + len(sequence)
print sequence_id, start, stop