从多个文件中获取公共行，仅针对特定字段

Question

我试图理解以下代码用于使用 BASH 在多个文件中拉出重叠行。

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) 
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ([=10=]), concatenating 
  # the filenames separated by / as values
  rec[[=10=]] = rec[[=10=]] ? rec[[=10=]] "/" FILENAME : FILENAME
  }' file[a-d]

在了解每个子代码块的作用后，我想扩展这段代码以找到重叠的特定字段，而不是整行。例如，我尝试更改行：

n = split(rec[R], t, "/")

到

n = split(rec[R], t, "/")

查找第一个字段在所有文件中都相同的行，但这没有用。最后我想扩展它以检查一行的字段 1、2 和 4 是否相同，然后打印该行。

具体来说，对于link中示例中提到的文件：如果文件 1 是：

chr1    31237964    NP_055491.1    PUM1    M340L
chr1    33251518    NP_037543.1    AK2    H191D

文件 2 是：

chr1    116944164    NP_001533.2    IGSF3    R671W
chr1    33251518    NP_001616.1    AK2    H191D
chr1    57027345    NP_001004303.2    C1orf168    P270S

我要拔：

file1/file2 --> chr1    33251518    AK2    H191D

我在下面 link 找到了这段代码： http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738。具体来说，我想从文件本身了解 R、rec、n、dup 和 D 代表什么。从提供的评论中不清楚，我在子循环中添加的 printf 语句失败了。

非常感谢您对此的任何见解！

Answer 1

首先您需要了解 AWK 脚本中的 3 个块：

BEGIN{
# A code that is executed once before the data processing start
}

{
# block without a name (default/main block)
# executed pet line of input
# [=10=] contains all line data/columns
#  first column
#  second column, and so on..
}

END{
# A code that is executed once after all data processing finished
}

因此您可能需要编辑脚本的这一部分：

  {  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ([=11=]), concatenating 
  # the filenames separated by / as values
  rec[[=11=]] = rec[[=11=]] ? rec[[=11=]] "/" FILENAME : FILENAME
  }

Answer 2

该脚本通过构建一个辅助数组来工作，其索引是输入文件中的行（在 rec[[=13=]] 中用 [=12=] 表示），值为 filename1/filename3/...对于那些存在给定行 [=12=] 的文件名。您可以将其破解为仅使用 </code>、<code> 和 </code>，如下所示：</p> <pre><code>awk 'END { # the END block is executed after # all the input has been read # loop over the rec array # and build the dup array indxed by the nuber of # filenames containing a given record for (R in rec) { n = split(rec[R], t, "/") if (n > 1) { split(R,R1R2R4,SUBSEP) dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \ sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) } } # loop over the dup array # and report the number and the names of the files # containing the record for (D in dup) { printf "records found in %d files:\n\n", D printf "%s\n\n", dup[D] } } { # build an array named rec (short for record), indexed by # the partial content of the current record # (special concatenation of , and ) # concatenating the filenames separated by / as values rec[,,] = rec[,,] ? rec[,,] "/" FILENAME : FILENAME }' file[a-d]

准确地说，此解决方案使用了 multidimensional arrays: we create rec[,,] instead of rec[[=13=]]. This special syntax of awk concatenates the indices with the SUBSEP character, which is by default non-printable ("4"），因此它不太可能成为任何一个字段的一部分。实际上它确实 rec[ SUBSEP SUBSEP ]=...。否则这部分代码是相同的。请注意，将第二个块移动到脚本的开头并以 END 块结束会更合乎逻辑。

代码的第一部分也必须更改：现在 for (R in rec) 遍历这些棘手的串联索引 SUBSEP SUBSEP 。这在索引时很好，但是您需要在 SUBSEP 字符处 split R 以再次获取可打印字段 </code>、<code>、</code>.这些被放入数组 <code>R1R2R4 中，可用于打印必要的输出：我们现在有 %s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3], 而不是 %s,...,R。实际上，我们正在使用预先保存的字段 </code>、<code>、</code> 执行 <code>sprintf ...%s,...,,,;。对于您的输入示例，这将打印

records found in 2 files:

    foo11.inp1/foo11.inp2 -->   chr1    33251518    AK2

请注意缺少输出 H191D 但这是正确的：它不在字段 1、2 或 4 中（而是在字段 5 中），因此 不能保证它是打印文件中也一样！您可能不想打印它，或者无论如何必须指定您应该如何处理文件之间未检查的列（因此可能不同）。

原代码的一点解释：

rec 是一个数组，其索引是完整的输入行，值是出现这些行的文件的斜线分隔列表。例如，如果 file1 包含一行“foo bar”，那么最初是 rec["foo bar"]=="file1"。如果 file2 也包含此行，则 rec["foo bar"]=="file1/file2"。请注意，没有检查多重性，因此如果 file1 两次包含此行，那么最终您将得到 rec["foo bar"]=file1/file1/file2 并获得包含此行的文件数 3。
R 在数组 rec 完全构建后遍历索引。这意味着 R 最终将假设每个输入文件的每一行都是唯一的，允许我们遍历 rec[R]，包含特定行 R 所在的文件名。
n 是 split 的 return 值，它拆分了 rec[R] 的值 --- 即对应于行 [=28= 的文件名列表] --- 在每个斜线处。最终数组 t 填充了文件列表，但我们没有使用它，我们只使用数组的长度 t，即第 [= 行中的文件数28=] 存在（这保存在变量 n 中）。如果n==1，我们不做任何事情，只有存在多重性。
n 的循环根据给定行的多重性创建 classes。 n==2 适用于正好存在于 2 个文件中的行。 n==3 出现三次的，依此类推。这个循环的作用是构建一个数组 dup，它为每个多重性 class（即每个 n）创建输出字符串 "filename1/filename2/... --> R"，其中每个字符串对于在文件中总共出现 n 次的 R 的每个值，用 RS（记录分隔符）分隔。因此，对于给定的 n，最终 dup[n] 将包含给定数量的 "filename1/filename2/... --> R" 形式的字符串，并与 RS 字符（默认为换行符）连接。
D in dup 上的循环然后将经历多重性 classes（即 n 的有效值大于 1），并打印 [=] 中收集的输出行78=] 每个 D。由于我们只为 n>1 定义了 dup[n]，如果有重数，D 从 2 开始（或者，如果没有，则 dup 为空，并且循环超过 D 将不会执行任何操作。

从多个文件中获取公共行，仅针对特定字段

Get common lines, for only specific fields, from multiple files

bash

awk

overlapping-matches