uniq 仅由行的一部分
uniq by only a part of the line
我正在尝试合并电子邮件列表,但我想 uniq
(或 uniq -i -u
)电子邮件地址,而不是整行,这样我们就不会重复。
列表 1:
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
列表 2:
firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>
当前输出为
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>
所需的输出将是
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>
(因为两者都列出了 companyb@companyb.com
)
我该怎么做?
这是 awk 中的一个:
$ awk '
match([=10=],/[a-z0-9.]+@[a-z.]+/) { # look for emailish string *
a[substr([=10=],RSTART,RLENGTH)]=[=10=] # and hash the record using the address as key
}
END { # after all are processed
for(i in a) # output them in no particular order
print a[i]
}' file2 file1 # switch order to see how it affects output
输出
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
Joe lastnanme <joe@gmail.com>
firstname lastname <firstname@gmail.com>
脚本寻找非常简单的电子邮件字符串(*查看脚本中的正则表达式并根据您的喜好调整它),它用于散列整个记录,最后一个实例获胜,因为较早的实例被覆盖。
能否请您尝试以下。
awk '
{
match([=10=],/<.*>/)
val=substr([=10=],RSTART,RLENGTH)
}
FNR==NR{
a[val]=[=10=]
print
next
}
!(val in a)
' list1 list2
解释: 添加上面代码的解释。
awk ' ##Starting awk program here.
{ ##Starting BLOCK which will be executed for both of the Input_files.
match([=11=],/<.*>/) ##Using match function of awk where giving regex to match everything from < to till >
val=substr([=11=],RSTART,RLENGTH) ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
} ##Closing above BLOCK here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
a[val]=[=11=] ##Creating an array named a whose index is val and value is current line.
print [=11=] ##Printing current line here.
next ##next will skip all further statements from here.
}
!(val in a) ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2 ##Mentioning Input_file names here.
输出如下。
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>
uniq
有一个 -f
选项可以忽略一些空白分隔的字段,因此我们可以对第三个字段进行排序,然后忽略前两个:
$ sort -k 3,3 infile | uniq -f 2
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>
但是,这不是很稳健:只要电子邮件地址前没有恰好两个字段,它就会中断,因为排序将在错误的字段上进行,uniq
将比较错误的字段.
查看 karakfa 的回答,了解这里甚至不需要 uniq
。
或者,只检查最后一个字段的唯一性:
awk '!e[$NF] {print; ++e[$NF]}' infile
甚至更短,从karakfa偷,awk '!e[$NF]++' infile
给定你的文件格式
$ awk -F'[<>]' '!a[]++' files
将在尖括号中打印重复内容的第一个实例。或者如果邮件后没有内容,则不需要拆尖括号
$ awk '!a[$NF]++' files
同样可以用sort
来完成
$ sort -t'<' -k2,2 -u files
副作用是输出将按需要(或不需要)排序。
N.B. 对于这两种选择,假设是尖括号不会出现在电子邮件包装器以外的任何地方。
可能是我没看懂问题!
但你可以试试这个 awk :
awk 'NR!=FNR && in a{next}{a[]}1' list1 list2
我正在尝试合并电子邮件列表,但我想 uniq
(或 uniq -i -u
)电子邮件地址,而不是整行,这样我们就不会重复。
列表 1:
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
列表 2:
firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>
当前输出为
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Fake Person <companyb@companyb.com>
Joe lastnanme <joe@gmail.com>
所需的输出将是
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>
(因为两者都列出了 companyb@companyb.com
)
我该怎么做?
这是 awk 中的一个:
$ awk '
match([=10=],/[a-z0-9.]+@[a-z.]+/) { # look for emailish string *
a[substr([=10=],RSTART,RLENGTH)]=[=10=] # and hash the record using the address as key
}
END { # after all are processed
for(i in a) # output them in no particular order
print a[i]
}' file2 file1 # switch order to see how it affects output
输出
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
Joe lastnanme <joe@gmail.com>
firstname lastname <firstname@gmail.com>
脚本寻找非常简单的电子邮件字符串(*查看脚本中的正则表达式并根据您的喜好调整它),它用于散列整个记录,最后一个实例获胜,因为较早的实例被覆盖。
能否请您尝试以下。
awk '
{
match([=10=],/<.*>/)
val=substr([=10=],RSTART,RLENGTH)
}
FNR==NR{
a[val]=[=10=]
print
next
}
!(val in a)
' list1 list2
解释: 添加上面代码的解释。
awk ' ##Starting awk program here.
{ ##Starting BLOCK which will be executed for both of the Input_files.
match([=11=],/<.*>/) ##Using match function of awk where giving regex to match everything from < to till >
val=substr([=11=],RSTART,RLENGTH) ##Creating variable named val whose value is substring of current line starting from RSTART to value of RLENGTH, basically matched string.
} ##Closing above BLOCK here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st Input_file named list1 will be read.
a[val]=[=11=] ##Creating an array named a whose index is val and value is current line.
print [=11=] ##Printing current line here.
next ##next will skip all further statements from here.
}
!(val in a) ##Checking condition if variable val is NOT present in array a if it is NOT present then do printing of current line.
' list1 list2 ##Mentioning Input_file names here.
输出如下。
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>
uniq
有一个 -f
选项可以忽略一些空白分隔的字段,因此我们可以对第三个字段进行排序,然后忽略前两个:
$ sort -k 3,3 infile | uniq -f 2
Company A <companya@companya.com>
Company B <companyb@companyb.com>
Company C <companyc@companyc.com>
firstname lastname <firstname@gmail.com>
Joe lastnanme <joe@gmail.com>
但是,这不是很稳健:只要电子邮件地址前没有恰好两个字段,它就会中断,因为排序将在错误的字段上进行,uniq
将比较错误的字段.
查看 karakfa 的回答,了解这里甚至不需要 uniq
。
或者,只检查最后一个字段的唯一性:
awk '!e[$NF] {print; ++e[$NF]}' infile
甚至更短,从karakfa偷,awk '!e[$NF]++' infile
给定你的文件格式
$ awk -F'[<>]' '!a[]++' files
将在尖括号中打印重复内容的第一个实例。或者如果邮件后没有内容,则不需要拆尖括号
$ awk '!a[$NF]++' files
同样可以用sort
来完成
$ sort -t'<' -k2,2 -u files
副作用是输出将按需要(或不需要)排序。
N.B. 对于这两种选择,假设是尖括号不会出现在电子邮件包装器以外的任何地方。
可能是我没看懂问题!
但你可以试试这个 awk :
awk 'NR!=FNR && in a{next}{a[]}1' list1 list2