使用两列比较三个文件并使用 awk/sed 在每个文件中打印唯一条目
Compare three files using two columns and print unique entries in each file using awk/sed
我有以下格式的三个文件:
$ cat a.bed
chr1 6 6 aa
chr1 8 8 bb
chr2 22 22 aa
chr3 24 24 bb
$ cat b.bed
chr1 12 12 cc
chr1 6 6 dd
chr5 14 14 cc
$ cat c.bed
chr1 8 8 ss
chr4 11 11 dd
chr1 6 6 aa
我想使用前两列比较这些文件并打印每一行的信息,无论它存在于一个文件还是多个文件中,例如:
chr1 6 6 aa 3 a.bed,b.bed,c.bed
chr1 8 8 bb 2 a.bed,c.bed
chr2 22 22 aa 1 a.bed
chr3 24 24 bb 1 a.bed
chr1 12 12 cc 1 b.bed
chr5 14 14 cc 1 b.bed
chr4 11 11 dd 1 c.bed
其中第 5 列给出了它所在的文件数,第 6 列给出了文件名。
试试这四行 gawk(在 awk 中似乎不起作用):
gawk '{print [=10=], FILENAME}' a.bed > abc.bed
gawk '{print [=10=], FILENAME}' b.bed >> abc.bed
gawk '{print [=10=], FILENAME}' c.bed >> abc.bed
gawk '{f = ;k= " " " " " " ;if(k in a){a[k] = a[k] "," f}else{a[k] = f};c[k]++};END{for(k in a){print k, c[k], a[k]}}' abc.bed
为简洁起见,单个字符变量:
f - 文件名,
k - 键,即数据,
a - 一组键,
c - 键计数数组。
呃,如果我没看错的话,你的输入和输出数据样本不匹配,例如只有 2 'chr1 6 6 aa' 而不是 3.
awk
救援!
$ awk '{a[,]=((,) in a?a[,]",":[=10=] OFS)FILENAME}
END{for(k in a) print a[k]}' {a,b,c}.bed
虽然结果不会按相同的顺序排列。
说明
x=c?a:b
is the ternary operator, sets x to a or b based on value of c (similar to if-then-else). Here we assign the value of map for key (,)
either by appending FILENAME
(if already exists) or setting to the current line (again by appending FILENAME
). In the END
block, just iterates over this map, and prints the values.
我有以下格式的三个文件:
$ cat a.bed
chr1 6 6 aa
chr1 8 8 bb
chr2 22 22 aa
chr3 24 24 bb
$ cat b.bed
chr1 12 12 cc
chr1 6 6 dd
chr5 14 14 cc
$ cat c.bed
chr1 8 8 ss
chr4 11 11 dd
chr1 6 6 aa
我想使用前两列比较这些文件并打印每一行的信息,无论它存在于一个文件还是多个文件中,例如:
chr1 6 6 aa 3 a.bed,b.bed,c.bed
chr1 8 8 bb 2 a.bed,c.bed
chr2 22 22 aa 1 a.bed
chr3 24 24 bb 1 a.bed
chr1 12 12 cc 1 b.bed
chr5 14 14 cc 1 b.bed
chr4 11 11 dd 1 c.bed
其中第 5 列给出了它所在的文件数,第 6 列给出了文件名。
试试这四行 gawk(在 awk 中似乎不起作用):
gawk '{print [=10=], FILENAME}' a.bed > abc.bed
gawk '{print [=10=], FILENAME}' b.bed >> abc.bed
gawk '{print [=10=], FILENAME}' c.bed >> abc.bed
gawk '{f = ;k= " " " " " " ;if(k in a){a[k] = a[k] "," f}else{a[k] = f};c[k]++};END{for(k in a){print k, c[k], a[k]}}' abc.bed
为简洁起见,单个字符变量: f - 文件名, k - 键,即数据, a - 一组键, c - 键计数数组。
呃,如果我没看错的话,你的输入和输出数据样本不匹配,例如只有 2 'chr1 6 6 aa' 而不是 3.
awk
救援!
$ awk '{a[,]=((,) in a?a[,]",":[=10=] OFS)FILENAME}
END{for(k in a) print a[k]}' {a,b,c}.bed
虽然结果不会按相同的顺序排列。
说明
x=c?a:b
is the ternary operator, sets x to a or b based on value of c (similar to if-then-else). Here we assign the value of map for key(,)
either by appendingFILENAME
(if already exists) or setting to the current line (again by appendingFILENAME
). In theEND
block, just iterates over this map, and prints the values.