从文本中查找重复项的位置

Question

我的数据格式如下：

1;string1
2;string2
...
n;stringn

第一列是身份证号，第二列是文本字符串。文本字符串可以包含数字、字母和字符，例如 /.()?!。 ID 号等于行号。我试图找出这些文本字符串中的重复项。我希望获得这样的信息：

String of id 1 is duplicated on lines/ids 4,6,7
String of id 2 is duplicated on lines/ids 11,25

到目前为止，我已经使用 Awk 命令完成了此操作：

awk '/String of text/ {print FNR}' targetfile

并手动替换了我文件中每个文本字符串的搜索字符串。由于数据集现在更大，这变得不切实际。能否改进我的 Awk 命令，使其自动测试文件中的每个文本字符串和其他字符串并输出我正在寻找的信息？我虽然为此使用了 for-loop，但无法弄清楚如何让它工作。

如果有更好的解决方案，我也可以使用 Awk 以外的其他工具。我的系统是 Ubuntu 14.04.

Answer 1

放这个（评论中的解释）：

{ seen[] = seen[]  " " }               # remember where you saw strings
                                             # as string of numbers

END {                                        # in the end
  for(s in seen) {                           # for all strings you saw
    split(seen[s], nums, " ");               # split apart the line numbers again

    if(length(nums) > 1) {                   # if you saw it more than once
      line = s " is duplicated on lines";    # build the output line
      for(i = 1; i <= length(nums); ++i) {   # with all the line numbers where you 
        line = line " " nums[i]              # saw it
      }
      print line                             # and print the line
    }
  }
}

到一个文件中，比如 foo.awk，然后运行 awk -F \; -f foo.awk filename

你也可以像这样把它放在一行中：

awk -F \; '{ seen[] = seen[]  " " } END { for(s in seen) { split(seen[s], nums, " "); if(length(nums) > 1) { line = s " is duplicated in lines"; for(i = 1; i <= length(nums); ++i) { line = line " " nums[i] } print line } } }' filename

...但它足够长，我会改用文件。

从文本中查找重复项的位置

Finding location of duplicates from text

linux

bash

ubuntu

awk

text