BASH comm 命令，但用于多列

Question

我正在寻找类似于 bash 命令 comm 的东西，我可以使用它来 select 条目，这两个条目对于我的 2 个文件是唯一的，也是它们共有的。当我每个文件只有一列时，Comm 工作得很好，例如。

 comm -13 FILE1.txt FILE2.txt > Entries_only_in_file1.txt

但现在我有多列信息要保留。我想 select 第 2 列作为过滤行以查找我的两个文件之间的唯一条目和公共条目的条目。如果第二列中的条目出现在两个文件中，我还想在第 3、4 和 5 列中记录信息（如果可能，这不是那么重要）。这是输入和输出的示例。

FILE1.txt
NM_023928   AACS    2   2   1
NM_182662   AADAT   2   2   1
NM_153698   AAED1   1   5   3
NM_001271   AAGAB   2   2   1


FILE2.txt
NM_153698   AAED1   2   5   3
NM_001271   AAGAB   2   2   1
NM_001605   AARS    3   40  37
NM_212533   ABCA2   3   4   2

想要的输出：

COMMON.txt
NM_153698   AAED1   1   5   3   2   5   3
NM_001271   AAGAB   2   2   1   2   2   1

UNIQUE_TO_1.txt
NM_023928   AACS    2   2   1
NM_182662   AADAT   2   2   1

UNIQUE_TO_2.txt
NM_001605   AARS    3   40  37
NM_212533   ABCA2   3   4   2

我知道以前有过类似的问题，但我找不到我要找的东西。非常感谢任何想法，谢谢。

Answer 1

您可以使用 gnu awk 实现，这是一个脚本：

script.awk

function unique(filename, line) {
    split( line , tmp, FS)
    print tmp[1], tmpp[2], tmp[3], tmp[4], tmp[5] >> filename
}

NR == FNR { # in case we are reading the first file: store line under key
        file1[  ] = [=10=]
        next
    }

    {
        if(  in file1 ) { # key from file2 was in also in file1:
            split( file1[  ], tmp, FS)
            print , , tmp[3], tmp[4], tmp[5], , ,  >> "COMMON.txt"
   # remove common key, thus we can later find unique keys from file1
            delete file1[  ] 
        }
        else { # unique key from file2 
            unique("UNIQUE_TO_2.txt", [=10=])
        }
    }

END {
  # remaining keys are unique in file1
        for( k in file1 ) {
            unique("UNIQUE_TO_1.txt", file1[ k ])
        }
    }

这样使用：

# erase the output files if present
rm -f COMMON.txt UNIQUE_TO_1.txt UNIQUE_TO_2.txt
# run script, create the file
awk -f script.awk FILE1.txt FILE2.txt
# output the files
for f in COMMON.txt UNIQUE_TO_1.txt UNIQUE_TO_2.txt; do echo "$f"; cat "$f"; done

printf ... >> filename 将文本附加到文件名。当第二次运行脚本时，这需要输出文件的 rm。

Answer 2

join 具有以下对您的任务有用的选项：

-j FIELD：加入字段 FIELD
-o FORMAT：指定输出格式，作为 FILENUM.FIELD.
-v FILENUM：仅在 FILENUM.

两个文件的共同点：

$ join -j2 -o 1.1,1.2,1.3,1.4,1.5,2.3,2.4,2.5 FILE1.txt FILE2.txt 
NM_153698 AAED1 1 5 3 2 5 3
NM_001271 AAGAB 2 2 1 2 2 1

FILE1 独有：

$ join -j2 -v1 FILE1.txt FILE2.txt 
AACS NM_023928 2 2 1
AADAT NM_182662 2 2 1

FILE2 独有：

$ join -j2 -v2 FILE1.txt FILE2.txt 
AARS NM_001605 3 40 37
ABCA2 NM_212533 3 4 2

BASH comm 命令，但用于多列

BASH comm command, but for multiple columns

bash

unique

comm