重命名 gffile 中的名称 ID。
Renaming Name ID in gffile.
我有一个 gff 文件,如下所示:
contig1 loci gene 452050 453069 15 - . ID=dd_g4_1G94;
contig1 loci mRNA 452050 453069 14 - . ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci exon 452050 452543 . - . ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci exon 452592 453069 . - . ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci exon 452592 452691 . - . ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci exon 452729 453069 . - . ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###
我想重命名 ID 名称,从 0001 开始,这样对于上述基因,条目是:
contig1 loci gene 452050 453069 15 - . ID=dd_0001;
contig1 loci mRNA 452050 453069 14 - . ID=dd_0001.1;Parent=dd_0001
contig1 loci exon 452050 452543 . - . ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci exon 452592 453069 . - . ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci exon 452592 452691 . - . ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci exon 452729 453069 . - . ID=dd_0001.2.exon3;Parent=dd_0001.2
上面的例子只是针对一个基因条目,但我想重命名所有基因,以及它们对应的mRNA/exon,从ID = dd_0001开始连续重命名。
任何有关如何执行此操作的提示将不胜感激。
需要打开文件,然后逐行替换id。
这是 file I/O and str.replace().
的文档参考
gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'
lines = []
with open(gff_filename, 'r') as gff_file:
for line in gff_file:
line = line.replace(replace_string, replace_with)
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
在 Windows 10、Python 3.5.1 中测试,有效。
要搜索 ID,您应该使用 regex。
import re
gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;\.]'
ids = []
lines = []
with open(gff_filename, 'r') as gff_file:
file_lines = [line for line in gff_file]
for line in file_lines:
matches = re.findall(re_pattern, line)
for found_id in matches:
if found_id not in ids:
ids.append(found_id)
for line in file_lines:
for ID in ids:
if ID in line:
id_suffix = str(ids.index(ID)).zfill(4)
line = line.replace(ID, replace_with.format(id_suffix))
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
还有其他方法可以做到这一点,但这非常可靠。
我有一个 gff 文件,如下所示:
contig1 loci gene 452050 453069 15 - . ID=dd_g4_1G94;
contig1 loci mRNA 452050 453069 14 - . ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci exon 452050 452543 . - . ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci exon 452592 453069 . - . ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci exon 452592 452691 . - . ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci exon 452729 453069 . - . ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
###
我想重命名 ID 名称,从 0001 开始,这样对于上述基因,条目是:
contig1 loci gene 452050 453069 15 - . ID=dd_0001;
contig1 loci mRNA 452050 453069 14 - . ID=dd_0001.1;Parent=dd_0001
contig1 loci exon 452050 452543 . - . ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci exon 452592 453069 . - . ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci mRNA 452153 453069 15 - . ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci exon 452153 452543 . - . ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci exon 452592 452691 . - . ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci exon 452729 453069 . - . ID=dd_0001.2.exon3;Parent=dd_0001.2
上面的例子只是针对一个基因条目,但我想重命名所有基因,以及它们对应的mRNA/exon,从ID = dd_0001开始连续重命名。 任何有关如何执行此操作的提示将不胜感激。
需要打开文件,然后逐行替换id。
这是 file I/O and str.replace().
gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'
lines = []
with open(gff_filename, 'r') as gff_file:
for line in gff_file:
line = line.replace(replace_string, replace_with)
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
在 Windows 10、Python 3.5.1 中测试,有效。
要搜索 ID,您应该使用 regex。
import re
gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;\.]'
ids = []
lines = []
with open(gff_filename, 'r') as gff_file:
file_lines = [line for line in gff_file]
for line in file_lines:
matches = re.findall(re_pattern, line)
for found_id in matches:
if found_id not in ids:
ids.append(found_id)
for line in file_lines:
for ID in ids:
if ID in line:
id_suffix = str(ids.index(ID)).zfill(4)
line = line.replace(ID, replace_with.format(id_suffix))
lines.append(line)
with open(gff_filename, 'w') as gff_file:
gff_file.writelines(lines)
还有其他方法可以做到这一点,但这非常可靠。