按字母数字对列进行排序

Sorting on a column alphanumerically

我有以下文件,我想根据第 6 列按字母数字顺序对它进行排序,这样当我排序时,E1 后跟 I1,然后是 E2 等等,在“:”之前有一个特定的 ID - V -k6 文件,它将所有 ID:Is 放在末尾,而不是它们应该 be.However 的位置,当我执行 sort -k6 时,它确实将 ID 的 Es 和 Is 放在一起,但有些 ID 属于不同的系列穿插(我在这里突出显示了它们),我怎样才能得到排序,这样就不会混合两个 ID,并且列的顺序应该是:

chr1    259017  259121  104 -   ENSG00000228463:E2
chr1    259122  267095  7973    -   ENSG00000228463:I1
chr1    267096  267253  157 -   ENSG00000228463:E1
chr1    317720  317781  61  +   ENSG00000237094:E1
chr1    317782  320161  2379    +   ENSG00000237094:I1
chr1    320162  320653  491 +   ENSG00000237094:E2
chr1    320654  320880  226 +   ENSG00000237094:I2
chr1    320881  320938  57  +   ENSG00000237094:E3
chr1    320939  321031  92  +   ENSG00000237094:I3
chr1    321032  321290  258 +   ENSG00000237094:E4
chr1    321291  322037  746 +   ENSG00000237094:I4
chr1    322038  322228  190 +   ENSG00000237094:E5
chr1    322229  322671  442 +   ENSG00000237094:I5
chr1    322672  323073  401 +   ENSG00000237094:E6
chr1    323074  323860  786 +   ENSG00000237094:I6
chr1    323861  324060  199 +   ENSG00000237094:E7
chr1    324061  324287  226 +   ENSG00000237094:I7
chr1    324288  324345  57  +   ENSG00000237094:E8
chr1    324346  324438  92  +   ENSG00000237094:I8
chr1    324439  326514  2075    +   ENSG00000237094:E9
**chr1  326096  326569  473 +   ENSG00000250575:E1**
chr1    326515  327551  1036    +   ENSG00000237094:I9
**chr1  326570  327347  777 +   ENSG00000250575:I1**
**chr1  327348  328112  764 +   ENSG00000250575:E2**
chr1    327552  328453  901 +   ENSG00000237094:E10
chr1    328454  329783  1329    +   ENSG00000237094:I10
**chr1  329431  329620  189 -   ENSG00000233653:E2**
**chr1  329621  329949  328 -   ENSG00000233653:I1**
chr1    329784  329976  192 +   ENSG00000237094:E11

原回答:

sed 's/:[EI]/&_ /' foo.txt |  #separate the number at the end with a space
sort -k6 | sort -n -k7 |         #sort by code, then by [EI] number
sed 's/_ //'                  #remove the underscore space

我喜欢用 'protecting' 带有占位符的字符串来做这样的事情,以隔离我感兴趣的内容,然后再替换它们。

近距离:

sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'

但这天真地假设排序以一种非常具体的方式工作,但事实并非如此……所以有时 E2 会在 E1 之前出现……

我不确定单独使用 sort 是否可以完成,awk 可能是可行的方法...

所以我回到这个问题并写了一些 python 实际完成任务的代码:

#!/usr/bin/env python

import sys
import re
from collections import defaultdict

#loop through args
for thisarg in sys.argv[1:]:
    #initialize a defualt dict
    bysign = defaultdict(list)

    #read the file
    try:
        thisfile = open(thisarg,'r')
        for line in thisfile:
            #split each line by space and colon
            dat = re.split('[ :]*',line.strip())
            #append line to dictionary indexed by ENSG code
            bysign[dat[-2]].append(line.strip())
        thisfile.close()
    except IOError:
        print "no such file {:}".format(thisarg)

    #extract the keys from the dictionary
    mykeys = bysign.keys()
    #sort the keys
    mykeys.sort()
    for key in mykeys:
        #initialize another, smaller dictionary
        bytuple = dict()
        #loop through all the lines that have the same ENSG code
        group = bysign[key]
        for line in group:
            #extract the E/I code
            ei=line.split(':')[-1]
            #convert the E/I code to a (char,int) tuple
            letter = ei[0]
            number = int(ei[1:])
            #use that tuple to index the smaller dict
            bytuple[(letter,number)] = line
        #extract the keys from the sub-dictionary
        eikeys = bytuple.keys()
        #sort the keys
        eikeys.sort()
        #print the results
        for k in eikeys:
            print bytuple[k]

我希望你现在已经明白了。好奇是否有人足够关心改善我的 python.