通过坐标计算基因长度

Сalculating gene length by coordinates

我从同事那里收到了一份带有坐标的数千个基因的列表。它看起来像 这个:

NPHP4   Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive   1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN    Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3)       1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive        1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7   Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive   1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036

第三列有坐标,以染色体号开始,依次为起始位置和结束位置,以“:”分隔。如果一个基因有多个区域,则用“,”分隔:

1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036

我需要计算区域的长度,即每个基因(每一行)的结束位置和起始位置之间的差异(或者它们的总和,如果一个基因有多个区域),但是区域的数量是不同的所有线路。我试图在Excel中统计这个,但是碎片的数量太大,在某些情况下甚至没有显示。有什么方法可以用一些正则表达式为每一行计算这个吗?

我希望输出为第四列。例如,如果第三列:

1:1167623:1168684

我预计:

1:1167623:1168684 1061

如果坐标列为:

1:11907145:11907520,1:11906035:11906116,1:11907590:11907770

我预计:

1:11907145:11907520,1:11906035:11906116,1:11907590:11907770 636

非常感谢

使用 python 可以相当简单地做到这一点。我在下面提供了注释代码。

d = """\
NPHP4   Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive   1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN    Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3)       1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive        1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7   Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive   1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
FOOBAR 1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
"""

gene_rows = d.splitlines()

for gene_row in gene_rows:
    # Name like "NPHP4"
    gene_name = gene_row.split()[0]
    # List like ["1:6021825:6022054", "1:6008105:6008352", ...]
    regions = gene_row.split()[-1].split(",")
    # Counter to hold our total gene length.
    gene_length = 0
    for region in regions:
        # Split "1:6021825:6022054" into "1", "6021825", and "6022054"
        chromosome, start, end = region.split(":")
        # Update the gene length counter with this region's length.
        region_length = int(end) - int(start)
        gene_length += region_length
    print(gene_name, gene_length)

输出为

NPHP4 5984
ESPN 3296
PLEKHG5 4685
PARK7 928
FOOBAR 636