通过坐标计算基因长度
Сalculating gene length by coordinates
我从同事那里收到了一份带有坐标的数千个基因的列表。它看起来像
这个:
NPHP4 Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive 1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3) 1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive 1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7 Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive 1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
第三列有坐标,以染色体号开始,依次为起始位置和结束位置,以“:”分隔。如果一个基因有多个区域,则用“,”分隔:
1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
我需要计算区域的长度,即每个基因(每一行)的结束位置和起始位置之间的差异(或者它们的总和,如果一个基因有多个区域),但是区域的数量是不同的所有线路。我试图在Excel中统计这个,但是碎片的数量太大,在某些情况下甚至没有显示。有什么方法可以用一些正则表达式为每一行计算这个吗?
我希望输出为第四列。例如,如果第三列:
1:1167623:1168684
我预计:
1:1167623:1168684 1061
如果坐标列为:
1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
我预计:
1:11907145:11907520,1:11906035:11906116,1:11907590:11907770 636
非常感谢
使用 python 可以相当简单地做到这一点。我在下面提供了注释代码。
d = """\
NPHP4 Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive 1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3) 1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive 1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7 Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive 1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
FOOBAR 1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
"""
gene_rows = d.splitlines()
for gene_row in gene_rows:
# Name like "NPHP4"
gene_name = gene_row.split()[0]
# List like ["1:6021825:6022054", "1:6008105:6008352", ...]
regions = gene_row.split()[-1].split(",")
# Counter to hold our total gene length.
gene_length = 0
for region in regions:
# Split "1:6021825:6022054" into "1", "6021825", and "6022054"
chromosome, start, end = region.split(":")
# Update the gene length counter with this region's length.
region_length = int(end) - int(start)
gene_length += region_length
print(gene_name, gene_length)
输出为
NPHP4 5984
ESPN 3296
PLEKHG5 4685
PARK7 928
FOOBAR 636
我从同事那里收到了一份带有坐标的数千个基因的列表。它看起来像 这个:
NPHP4 Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive 1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3) 1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive 1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7 Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive 1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
第三列有坐标,以染色体号开始,依次为起始位置和结束位置,以“:”分隔。如果一个基因有多个区域,则用“,”分隔:
1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
我需要计算区域的长度,即每个基因(每一行)的结束位置和起始位置之间的差异(或者它们的总和,如果一个基因有多个区域),但是区域的数量是不同的所有线路。我试图在Excel中统计这个,但是碎片的数量太大,在某些情况下甚至没有显示。有什么方法可以用一些正则表达式为每一行计算这个吗?
我希望输出为第四列。例如,如果第三列:
1:1167623:1168684
我预计:
1:1167623:1168684 1061
如果坐标列为:
1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
我预计:
1:11907145:11907520,1:11906035:11906116,1:11907590:11907770 636
非常感谢
使用 python 可以相当简单地做到这一点。我在下面提供了注释代码。
d = """\
NPHP4 Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive 1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3) 1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive 1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7 Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive 1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
FOOBAR 1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
"""
gene_rows = d.splitlines()
for gene_row in gene_rows:
# Name like "NPHP4"
gene_name = gene_row.split()[0]
# List like ["1:6021825:6022054", "1:6008105:6008352", ...]
regions = gene_row.split()[-1].split(",")
# Counter to hold our total gene length.
gene_length = 0
for region in regions:
# Split "1:6021825:6022054" into "1", "6021825", and "6022054"
chromosome, start, end = region.split(":")
# Update the gene length counter with this region's length.
region_length = int(end) - int(start)
gene_length += region_length
print(gene_name, gene_length)
输出为
NPHP4 5984
ESPN 3296
PLEKHG5 4685
PARK7 928
FOOBAR 636