uniq 仅由行的一部分

Question

我正在尝试合并电子邮件列表，但我想 uniq（或 uniq -i -u）电子邮件地址，而不是整行，这样我们就不会重复。

列表 1：

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>

列表 2：

firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>

当前输出为

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>

所需的输出将是

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>

（因为两者都列出了 companyb@companyb.com）

我该怎么做？

Answer 1

这是 awk 中的一个：

$ awk '
match([=10=],/[a-z0-9.]+@[a-z.]+/) {      # look for emailish string *
    a[substr([=10=],RSTART,RLENGTH)]=[=10=]   # and hash the record using the address as key
}
END {                                 # after all are processed
    for(i in a)                       # output them in no particular order
        print a[i]
}' file2 file1                        # switch order to see how it affects output

输出

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
Joe lastnanme <joe@gmail.com>
firstname lastname <firstname@gmail.com>

脚本寻找非常简单的电子邮件字符串（*查看脚本中的正则表达式并根据您的喜好调整它），它用于散列整个记录，最后一个实例获胜，因为较早的实例被覆盖。

Answer 2

能否请您尝试以下。

awk '
{
   match([=10=],/<.*>/)
   val=substr([=10=],RSTART,RLENGTH)
}
FNR==NR{
   a[val]=[=10=]
   print
   next
}
!(val in a)
' list1 list2

解释： 添加上面代码的解释。

awk '                                    ##Starting awk program here.
{                                        ##Starting BLOCK which will be executed for both of the Input_files.
   match([=11=],/<.*>/)                      ##Using match function of awk where giving regex to match everything from < to till >
   val=substr([=11=],RSTART,RLENGTH)         ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
}                                        ##Closing above BLOCK here.
FNR==NR{                                 ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
   a[val]=[=11=]                             ##Creating an array named a whose index is val and value is current line.
   print [=11=]                              ##Printing current line here.
   next                                  ##next will skip all further statements from here.
}
!(val in a)                              ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2                            ##Mentioning Input_file names here.

输出如下。

Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>

Answer 3

uniq 有一个 -f 选项可以忽略一些空白分隔的字段，因此我们可以对第三个字段进行排序，然后忽略前两个：

$ sort -k 3,3 infile | uniq -f 2
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>

但是，这不是很稳健：只要电子邮件地址前没有恰好两个字段，它就会中断，因为排序将在错误的字段上进行，uniq 将比较错误的字段.

_{查看 karakfa 的回答，了解这里甚至不需要 uniq。}

或者，只检查最后一个字段的唯一性：

awk '!e[$NF] {print; ++e[$NF]}' infile

_{甚至更短，从karakfa偷，awk '!e[$NF]++' infile}

Answer 4

给定你的文件格式

$ awk -F'[<>]' '!a[]++' files

将在尖括号中打印重复内容的第一个实例。或者如果邮件后没有内容，则不需要拆尖括号

$ awk '!a[$NF]++' files

同样可以用sort来完成

$ sort -t'<' -k2,2 -u files

副作用是输出将按需要（或不需要）排序。

N.B. 对于这两种选择，假设是尖括号不会出现在电子邮件包装器以外的任何地方。

Answer 5

可能是我没看懂问题！
但你可以试试这个 awk :

awk 'NR!=FNR &&  in a{next}{a[]}1' list1 list2

uniq 仅由行的一部分

uniq by only a part of the line

email

awk

uniq