按字母数字对列进行排序
Sorting on a column alphanumerically
我有以下文件,我想根据第 6 列按字母数字顺序对它进行排序,这样当我排序时,E1 后跟 I1,然后是 E2 等等,在“:”之前有一个特定的 ID - V -k6 文件,它将所有 ID:Is 放在末尾,而不是它们应该 be.However 的位置,当我执行 sort -k6 时,它确实将 ID 的 Es 和 Is 放在一起,但有些 ID 属于不同的系列穿插(我在这里突出显示了它们),我怎样才能得到排序,这样就不会混合两个 ID,并且列的顺序应该是:
chr1 259017 259121 104 - ENSG00000228463:E2
chr1 259122 267095 7973 - ENSG00000228463:I1
chr1 267096 267253 157 - ENSG00000228463:E1
chr1 317720 317781 61 + ENSG00000237094:E1
chr1 317782 320161 2379 + ENSG00000237094:I1
chr1 320162 320653 491 + ENSG00000237094:E2
chr1 320654 320880 226 + ENSG00000237094:I2
chr1 320881 320938 57 + ENSG00000237094:E3
chr1 320939 321031 92 + ENSG00000237094:I3
chr1 321032 321290 258 + ENSG00000237094:E4
chr1 321291 322037 746 + ENSG00000237094:I4
chr1 322038 322228 190 + ENSG00000237094:E5
chr1 322229 322671 442 + ENSG00000237094:I5
chr1 322672 323073 401 + ENSG00000237094:E6
chr1 323074 323860 786 + ENSG00000237094:I6
chr1 323861 324060 199 + ENSG00000237094:E7
chr1 324061 324287 226 + ENSG00000237094:I7
chr1 324288 324345 57 + ENSG00000237094:E8
chr1 324346 324438 92 + ENSG00000237094:I8
chr1 324439 326514 2075 + ENSG00000237094:E9
**chr1 326096 326569 473 + ENSG00000250575:E1**
chr1 326515 327551 1036 + ENSG00000237094:I9
**chr1 326570 327347 777 + ENSG00000250575:I1**
**chr1 327348 328112 764 + ENSG00000250575:E2**
chr1 327552 328453 901 + ENSG00000237094:E10
chr1 328454 329783 1329 + ENSG00000237094:I10
**chr1 329431 329620 189 - ENSG00000233653:E2**
**chr1 329621 329949 328 - ENSG00000233653:I1**
chr1 329784 329976 192 + ENSG00000237094:E11
原回答:
sed 's/:[EI]/&_ /' foo.txt | #separate the number at the end with a space
sort -k6 | sort -n -k7 | #sort by code, then by [EI] number
sed 's/_ //' #remove the underscore space
我喜欢用 'protecting' 带有占位符的字符串来做这样的事情,以隔离我感兴趣的内容,然后再替换它们。
近距离:
sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'
但这天真地假设排序以一种非常具体的方式工作,但事实并非如此……所以有时 E2 会在 E1 之前出现……
我不确定单独使用 sort 是否可以完成,awk 可能是可行的方法...
所以我回到这个问题并写了一些 python 实际完成任务的代码:
#!/usr/bin/env python
import sys
import re
from collections import defaultdict
#loop through args
for thisarg in sys.argv[1:]:
#initialize a defualt dict
bysign = defaultdict(list)
#read the file
try:
thisfile = open(thisarg,'r')
for line in thisfile:
#split each line by space and colon
dat = re.split('[ :]*',line.strip())
#append line to dictionary indexed by ENSG code
bysign[dat[-2]].append(line.strip())
thisfile.close()
except IOError:
print "no such file {:}".format(thisarg)
#extract the keys from the dictionary
mykeys = bysign.keys()
#sort the keys
mykeys.sort()
for key in mykeys:
#initialize another, smaller dictionary
bytuple = dict()
#loop through all the lines that have the same ENSG code
group = bysign[key]
for line in group:
#extract the E/I code
ei=line.split(':')[-1]
#convert the E/I code to a (char,int) tuple
letter = ei[0]
number = int(ei[1:])
#use that tuple to index the smaller dict
bytuple[(letter,number)] = line
#extract the keys from the sub-dictionary
eikeys = bytuple.keys()
#sort the keys
eikeys.sort()
#print the results
for k in eikeys:
print bytuple[k]
我希望你现在已经明白了。好奇是否有人足够关心改善我的 python.
我有以下文件,我想根据第 6 列按字母数字顺序对它进行排序,这样当我排序时,E1 后跟 I1,然后是 E2 等等,在“:”之前有一个特定的 ID - V -k6 文件,它将所有 ID:Is 放在末尾,而不是它们应该 be.However 的位置,当我执行 sort -k6 时,它确实将 ID 的 Es 和 Is 放在一起,但有些 ID 属于不同的系列穿插(我在这里突出显示了它们),我怎样才能得到排序,这样就不会混合两个 ID,并且列的顺序应该是:
chr1 259017 259121 104 - ENSG00000228463:E2
chr1 259122 267095 7973 - ENSG00000228463:I1
chr1 267096 267253 157 - ENSG00000228463:E1
chr1 317720 317781 61 + ENSG00000237094:E1
chr1 317782 320161 2379 + ENSG00000237094:I1
chr1 320162 320653 491 + ENSG00000237094:E2
chr1 320654 320880 226 + ENSG00000237094:I2
chr1 320881 320938 57 + ENSG00000237094:E3
chr1 320939 321031 92 + ENSG00000237094:I3
chr1 321032 321290 258 + ENSG00000237094:E4
chr1 321291 322037 746 + ENSG00000237094:I4
chr1 322038 322228 190 + ENSG00000237094:E5
chr1 322229 322671 442 + ENSG00000237094:I5
chr1 322672 323073 401 + ENSG00000237094:E6
chr1 323074 323860 786 + ENSG00000237094:I6
chr1 323861 324060 199 + ENSG00000237094:E7
chr1 324061 324287 226 + ENSG00000237094:I7
chr1 324288 324345 57 + ENSG00000237094:E8
chr1 324346 324438 92 + ENSG00000237094:I8
chr1 324439 326514 2075 + ENSG00000237094:E9
**chr1 326096 326569 473 + ENSG00000250575:E1**
chr1 326515 327551 1036 + ENSG00000237094:I9
**chr1 326570 327347 777 + ENSG00000250575:I1**
**chr1 327348 328112 764 + ENSG00000250575:E2**
chr1 327552 328453 901 + ENSG00000237094:E10
chr1 328454 329783 1329 + ENSG00000237094:I10
**chr1 329431 329620 189 - ENSG00000233653:E2**
**chr1 329621 329949 328 - ENSG00000233653:I1**
chr1 329784 329976 192 + ENSG00000237094:E11
原回答:
sed 's/:[EI]/&_ /' foo.txt | #separate the number at the end with a space
sort -k6 | sort -n -k7 | #sort by code, then by [EI] number
sed 's/_ //' #remove the underscore space
我喜欢用 'protecting' 带有占位符的字符串来做这样的事情,以隔离我感兴趣的内容,然后再替换它们。
近距离:
sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'
但这天真地假设排序以一种非常具体的方式工作,但事实并非如此……所以有时 E2 会在 E1 之前出现……
我不确定单独使用 sort 是否可以完成,awk 可能是可行的方法...
所以我回到这个问题并写了一些 python 实际完成任务的代码:
#!/usr/bin/env python
import sys
import re
from collections import defaultdict
#loop through args
for thisarg in sys.argv[1:]:
#initialize a defualt dict
bysign = defaultdict(list)
#read the file
try:
thisfile = open(thisarg,'r')
for line in thisfile:
#split each line by space and colon
dat = re.split('[ :]*',line.strip())
#append line to dictionary indexed by ENSG code
bysign[dat[-2]].append(line.strip())
thisfile.close()
except IOError:
print "no such file {:}".format(thisarg)
#extract the keys from the dictionary
mykeys = bysign.keys()
#sort the keys
mykeys.sort()
for key in mykeys:
#initialize another, smaller dictionary
bytuple = dict()
#loop through all the lines that have the same ENSG code
group = bysign[key]
for line in group:
#extract the E/I code
ei=line.split(':')[-1]
#convert the E/I code to a (char,int) tuple
letter = ei[0]
number = int(ei[1:])
#use that tuple to index the smaller dict
bytuple[(letter,number)] = line
#extract the keys from the sub-dictionary
eikeys = bytuple.keys()
#sort the keys
eikeys.sort()
#print the results
for k in eikeys:
print bytuple[k]
我希望你现在已经明白了。好奇是否有人足够关心改善我的 python.