如何将 comm 命令的输出放入 3 个单独的文件中？

Question

问题Unix command to find lines common in two files has an answer建议使用comm命令来完成任务：

comm -12 1.sorted.txt 2.sorted.txt

这显示了两个文件共有的行（-1 抑制了仅在第一个文件中的行，而 -2 仅抑制了第二个文件中的行，只留下两个文件共有的行作为输出）。如文件名所示，输入文件必须按排序顺序排列。

在comment to that question, bapors中问：

How would one have the outputs in different files?

寻求澄清，我问：

If you want the lines only in File1 in one file, those only in File2 in another, and those in both in a third, then (provided that none of the lines in the files starts with a tab) you could use sed to split the output to three files.

已确认用户数据：

It is exactly what I was asking. Would you show an example?

答案相对冗长，会破坏另一个问题答案的简洁性（用大量信息淹没它），所以我在这里单独提出了这个问题——并提供了一个答案。

Answer 1

使用 sed 的基本解决方案依赖于 comm 输出仅在第一个文件中找到的没有前缀的行；它使用单个选项卡输出仅在第二个文件中找到的行；并使用两个选项卡输出在两个文件中找到的行。

它也依赖于sed的w命令来写入文件。

给定文件 1.sorted.txt 包含：

1.line-1
1.line-2
1.line-4
1.line-6
2.line-2
3.line-5

和文件 2.sorted.txt 包含：

1.line-3
2.line-1
2.line-2
2.line-4
2.line-6
3.line-5

comm 1.sorted.txt 2.sorted.txt 的基本输出是：

1.line-1
1.line-2
        1.line-3
1.line-4
1.line-6
        2.line-1
                2.line-2
        2.line-4
        2.line-6
                3.line-5

给定一个文件 script.sed 包含：

/^\t\t/ {
    s///
    w file.3
    d
}
/^\t/ {
    s///
    w file.2
    d
}
/^[^\t]/ {
    w file.1
    d
}

您可以运行下面显示的命令并获得所需的输出，如下所示：

$ comm 1.sorted.txt 2.sorted.txt | sed -f script.sed
$ cat file.1
1.line-1
1.line-2
1.line-4
1.line-6
$ cat file.2
1.line-3
2.line-1
2.line-4
2.line-6
$ cat file.3
2.line-2
3.line-5
$

该脚本的工作人员：

匹配以 2 个制表符开头的行，删除制表符，将行写入 file.3，并删除行（因此忽略脚本的其余部分），
匹配以 1 个制表符开头的行，删除制表符，将行写入 file.2，并删除行（因此忽略脚本的其余部分），
匹配不以制表符开头的行，将行写入 file.1，然后删除行。

步骤3中的匹配和删除操作更多的是为了对称而不是其他；它们可以被省略（只留下 w file.1），这个脚本也可以工作。但是，请参阅下面的 script3.sed 以了解保持对称性的进一步理由。

正如所写，这需要 GNU sed； BSD sed 不识别 \t 转义符。显然，文件可以用实际的制表符代替 \t 符号来编写，然后 BSD sed 就可以使用脚本了。

可以让它在命令行上全部工作，但它很繁琐（这是礼貌的做法）。使用 Bash 的 ANSI C Quoting，你可以这样写：

$ comm 1.sorted.txt 2.sorted.txt |
> sed -e $'/^\t\t/  { s///\n w file.3\n d\n }' \
>     -e $'/^\t/    { s///\n w file.2\n d\n }' \
>     -e $'/^[^\t]/ {        w file.1\n d\n }'
$

将 script.sed 的三个 'paragraphs' 中的每一个写在单独的 -e 选项中。 w 命令很繁琐；它需要文件名，并且只有文件名，在脚本的同一行之后，因此在脚本中的文件名之后使用 \n 。有很多空间可以消除，但所示布局的对称性更清晰。使用 -f script.sed 文件可能更简单——这当然是一项值得了解的技术，因为当 sed 脚本必须对单引号、双引号和反引号进行操作时，它可以避免出现问题，这使得编写变得困难Bash 命令行上的脚本。

最后，如果这两个文件可以包含以制表符开头的行，则此技术需要更多的暴力才能使其起作用。一种变体解决方案利用 Bash 的 process substitution 在文件中的行之前添加前缀，然后 post-processing sed 脚本在写入之前删除前缀输出文件。

script3.sed（制表符被最多 8 个空格替换）——请注意，这次第三段中需要替换 s///（d 仍然是可选的，但也可以包括在内):

/^              X/ {
    s///
    w file.3
    d
}
/^      X/ {
    s///
    w file.2
    d
}
/^X/ {
    s///
    w file.1
    d
}

和命令行：

$ comm <(sed 's/^/X/' 1.sorted.txt) <(sed 's/^/X/' 2.sorted.txt) |
> sed -f script3.sed
$

对于相同的输入文件，这会产生相同的输出，但是通过在每行的开头添加然后删除 X，代码不会更改数据的排序顺序，并且会处理前导制表符（如果存在）。

您还可以轻松编写使用 Perl 或 Awk 的解决方案，甚至不必使用 comm（并且可以处理未排序的文件，前提是文件适合内存）。

Answer 2

comm + awk 解决方法：

复杂的示例文件：

1.txt:

1. line-1 with spaces (                 |   | here
1.line-2
1.line-4    with tabs > 
 1.line-6
2.line-2
        3.line-5 (tabs)

2.txt:

1.line-3
  2.line-1 with spaces
2.line-2
2.line-4
    2.line-6 with tabs
        3.line-5 (tabs)

工作：

comm -12 1.txt 2.txt > file-common 
awk 'NR==FNR{ a[[=12=]];next }!([=12=] in a){ print [=12=] > "file"ARGIND-1 }' file-common 1.txt 2.txt

comm -12 1.txt 2.txt > file-common - 将公共行保存到 file-common 文件
awk ... - 将 1.txt 和 2.txt 独有的行分别打印到文件 file1 和 file2 中

查看结果：

head file*
==> file1 <==
1. line-1 with spaces (                 |   | here
1.line-2
1.line-4    with tabs > 
 1.line-6

==> file2 <==
1.line-3
  2.line-1 with spaces
2.line-4
    2.line-6 with tabs

==> file-common <==
2.line-2
        3.line-5 (tabs)

如何将 comm 命令的输出放入 3 个单独的文件中？

How to get the output from the comm command into 3 separate files?

unix

sed

comm