awk 搜索内容，如果它包含列表文件中的内容

Question

我在使用 AWK 搜索一个巨大的 csv 文件（可以称为 file1）时遇到了一些困难。幸运的是，我有一个列表文件（这可以称为 file2）。我可以根据 file2 中的索引列表文件搜索我需要的行。但是，file1 与任何其他普通文件不同，它类似于：

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID3, P01723;F08734;,ID3_name
ID4, AC0014;AC0114;P01112;,ID4_name
...
IDn, AC0006;,IDn_name
IDm, Ac8007; P01167;,IDm_name

索引文件2如：

所需的输出应该是：

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

如果我使用

awk -F, 'NR==FNR{a[]; next} ( in a)' file2 file1

如果我加“;”，我什么也得不到在 file2 中每一行的末尾，我只会得到 ID2, AC000801;,ID2_name。如果我更改 ~ a[]，它仍然不起作用。

所以，我想知道如何更改此命令以获得所需的结果。谢谢！

Answer 1

您可以将字段分隔符设置为逗号后跟可选空格 [[:space:]]*,[[:space:]]*

然后您可以将文件 1 的第二个字段拆分为分号和可选空格 [[:space:]]*;[[:space:]]* 并检查其中一个是否存在于 a

中

awk -F"[[:space:]]*,[[:space:]]*" 'NR==FNR{
  a[]; next
}
{
  split(, parts, /[[:space:]]*;[[:space:]]*/)
  for (i in parts) {
    if (parts[i] in a) {
      print [=10=]; break;
    }
  } 
}
' file2 file1

输出

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

Answer 2

使用您显示的示例，请尝试以下 awk 代码。

awk -F',|[[:space:]]+|;' '
FNR==NR{
  for(i=2;i<=NF;i++){
    arr[$i]=[=10=]
  }
  next
}
([=10=] in arr){
  print arr[[=10=]]
}
' file1 file2

说明：为以上代码添加详细说明。

awk -F',|[[:space:]]+|;' '  ##Setting field separator as comma, space(s), semi-colon here.
FNR==NR{                    ##This condition will be TRUE when file1 is being read.
  for(i=2;i<=NF;i++){       ##Using for loop to traverse from 2nd field to till last field.
    arr[$i]=[=11=]              ##Creating arr with index of current field, with value of current line.
  }
  next                      ##next will skip all further lines from here.
}
([=11=] in arr){                ##Checking condition if current line is present in arr.
  print arr[[=11=]]             ##Printing arr with index of [=11=] here.
}
' file1 file2               ##Mentioning Input_file names here.

Answer 3

假设：

搜索字符串只包含字符和数字

一个 GNU awk 想法，我们将词边界标志附加到我们的搜索模式，然后执行正则表达式比较：

awk -F',' '
FNR==NR { regs["\<"  "\>"]; next }
        { for (regex in regs)
              if ( ~ regex) { print; next }
        }
' file2 file1

这会生成：

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

Answer 4

如果您不限于使用 awk，我会使用 grep 来完成此任务：

grep -Fwf file2 file1

-f file2：使用file2的每一行作为搜索字符串。
-w：只匹配整个单词（这样模式 P01167 不会匹配 P011670）。除字母、数字和下划线以外的任何字符分隔单词（因此 P01167;, 将匹配）。
-F: 固定字符串 - 精确匹配字符串，这样任何正则表达式字符都没有特殊含义。

awk 搜索内容，如果它包含列表文件中的内容

awk search content if it contains content in a list file

bash

awk