两个文件中的单词不区分大小写和重音匹配
Case and accent insensitive matching of words in two files
我有两个 未加引号的单列 TSV 文件(从数据库导出),其中包含几千人的名字,我需要找到出现在这两个文件中的名字。两个文件都以 UTF-8
、CRLF
终止,并以 BOM 0xEF 0xBB 0xBF
.
开头
一个简单的 join
或 comm
命令就可以解决问题,但名称有一些不同:
# cat file1.tsv
A. Einstein
Louis Pasteur
Diego Armando Maradona
Isaac Newton
Frava D’onä
D Rüge
Françoise Barré-Sinoussi
# cat file2.tsv
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Marie-Louise Von FRANZ
Dimitri Rüge
file2.tsv
中的预期匹配为:
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Dimitri Rüge
我写了这个 bash
sed
awk
grep
动态生成匹配姓氏的正则表达式的脚本:
#!/bin/bash
# U+0300 = 0xCC80 = 52352
# U+033F = 0xCCBF = 52415
# U+0340 = 0xCD80 = 52608
# U+036E = 0xCDAE = 52654
_COMBINING_CHARS_=()
for i in {52352..52415} {52608..52654}
do
hex=$(printf %04X "$i")
_COMBINING_CHARS_+=( "$(printf '\x'"${hex:0:2}"'\x'"${hex:2:2}")" )
done
_COMBINING_CHARS_ERE_=$(IFS='|'; printf %s "${_COMBINING_CHARS_[*]}")
# Function that removes the BOM, CRLF, and COMBINING characters:
sanitize() {
LANG=C sed -E \
-e $'1s/^\xEF\xBB\xBF//' \
-e $'s/\r$//' \
-e "s/$_COMBINING_CHARS_ERE_//g" \
-- "$@"
}
# Function that generates a regex for the _lastname_:
toERE() {
awk '
{
if ( [=15=] ~ /,/) {
n = split([=15=], a, ",");
[=15=] = a[n];
} else {
[=15=] = $NF
}
sub("^[[:space]]+","");
sub("[[:space]]+$","");
gsub("[[:space:]-]+"," ");
}
{
ere = ""
sep = "";
for ( nf = 1; nf <= NF; nf++ ) {
n = split($nf, c, "");
for ( i = 1; i <= n; i++ ) {
ere = ere "[[=" c[i] "=]]"
}
ere = sep ere
sep = "[[:space:]-]+"
}
print ere "[[:space:]]*$"
}
' < <(sanitize "$@")
}
grep -E -f <(toERE "") <(sanitize "")
不幸的是,给定输入的结果是:
grep: illegal byte sequence
UTF-8 多字节字符似乎是问题所在,但我想不出用 awk
来处理它的方法
agrep
怎么样? man agrep
:agrep - 在文件中搜索具有近似匹配能力的字符串或正则表达式。它并不像我们将看到的那样完美:
$ while IFS= read -r line
do
echo -n "$line: "
agrep -B -y "$line" file1
done < file2
输出:
Diego A. Maradona: agrep: 1 word matches within 6 errors
Maradona, Diego Armando
Albert Einstein: agrep: 1 word matches within 5 errors
A. Einstein
Louis Pasteur: Louis Pasteur
frava dona: agrep: 2 words match within 4 errors
Maradona, Diego Armando
Fräva Dona
很好的示例,因为我们已经在最后三行中看到了问题。
建议以下技巧:
cat file1.csv file1.csv | sort | uniq -d
说明
cat file1.csv file1.csv
一个接一个地合并 bot 文件
sort
将相似的行放在一起
uniq -d
只打印有重复的行
最后我用 ruby
完成了工作,但我 post 一个 awk
解决方案。
有两个问题:
POSIX [= =]
不适用于变音符号
awk
不知道 multi-byte 个字符
可以通过将输入文件转换为 ASCII 来解决这个问题。 iconv
可以对罗马尼亚字符进行一些准确的处理,这正是我所需要的:
#!/bin/bash
to_ascii() {
case $(uname) in
Darwin)
iconv -f UTF-8 -t UTF-8-MAC "$@" |
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE
;;
Linux)
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE "$@"
;;
esac |
sed $'s/\r$//'
}
现在我们只需要做一点规范化,找到姓氏中的完美匹配:
awk '
{
gsub("-+","-");
gsub("+","");
gsub("[.[:space:]]+"," ");
sub("^[[:space:]]+","");
sub("[[:space:]]+$","");
}
{
if ([=11=] ~ /,/) {
n = split([=11=],a,"[[:space:]]*,[[:space:]]*");
lastname = a[n];
} else {
lastname = $NF;
}
gsub("[-[:space:]]+"," ",lastname);
lastname = tolower(lastname);
}
FNR == NR {
keys[lastname] = [=11=];
next;
}
{
count = 0;
for (n in keys) {
if (n == lastname) {
matches[++count] = keys[n]
}
}
if (count > 0) {
print [=11=]
for (i = 1; i <= count; i++) {
print "\t" matches[i]
}
}
}
' <(to_ascii "") <(to_ascii "")
输出
A Einstein
Albert Einstein
Louis Pasteur
Louis Pasteur
Diego Armando Maradona
Diego Maradona
Frava D'ona
frava d'ona
D Ruge
Dimitri Ruge
Francoise Barre-Sinoussi
Francoise, BARRE SINOUSSI
我有两个 未加引号的单列 TSV 文件(从数据库导出),其中包含几千人的名字,我需要找到出现在这两个文件中的名字。两个文件都以 UTF-8
、CRLF
终止,并以 BOM 0xEF 0xBB 0xBF
.
一个简单的 join
或 comm
命令就可以解决问题,但名称有一些不同:
# cat file1.tsv
A. Einstein
Louis Pasteur
Diego Armando Maradona
Isaac Newton
Frava D’onä
D Rüge
Françoise Barré-Sinoussi
# cat file2.tsv
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Marie-Louise Von FRANZ
Dimitri Rüge
file2.tsv
中的预期匹配为:
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Dimitri Rüge
我写了这个 bash
sed
awk
grep
动态生成匹配姓氏的正则表达式的脚本:
#!/bin/bash
# U+0300 = 0xCC80 = 52352
# U+033F = 0xCCBF = 52415
# U+0340 = 0xCD80 = 52608
# U+036E = 0xCDAE = 52654
_COMBINING_CHARS_=()
for i in {52352..52415} {52608..52654}
do
hex=$(printf %04X "$i")
_COMBINING_CHARS_+=( "$(printf '\x'"${hex:0:2}"'\x'"${hex:2:2}")" )
done
_COMBINING_CHARS_ERE_=$(IFS='|'; printf %s "${_COMBINING_CHARS_[*]}")
# Function that removes the BOM, CRLF, and COMBINING characters:
sanitize() {
LANG=C sed -E \
-e $'1s/^\xEF\xBB\xBF//' \
-e $'s/\r$//' \
-e "s/$_COMBINING_CHARS_ERE_//g" \
-- "$@"
}
# Function that generates a regex for the _lastname_:
toERE() {
awk '
{
if ( [=15=] ~ /,/) {
n = split([=15=], a, ",");
[=15=] = a[n];
} else {
[=15=] = $NF
}
sub("^[[:space]]+","");
sub("[[:space]]+$","");
gsub("[[:space:]-]+"," ");
}
{
ere = ""
sep = "";
for ( nf = 1; nf <= NF; nf++ ) {
n = split($nf, c, "");
for ( i = 1; i <= n; i++ ) {
ere = ere "[[=" c[i] "=]]"
}
ere = sep ere
sep = "[[:space:]-]+"
}
print ere "[[:space:]]*$"
}
' < <(sanitize "$@")
}
grep -E -f <(toERE "") <(sanitize "")
不幸的是,给定输入的结果是:
grep: illegal byte sequence
UTF-8 多字节字符似乎是问题所在,但我想不出用 awk
agrep
怎么样? man agrep
:agrep - 在文件中搜索具有近似匹配能力的字符串或正则表达式。它并不像我们将看到的那样完美:
$ while IFS= read -r line
do
echo -n "$line: "
agrep -B -y "$line" file1
done < file2
输出:
Diego A. Maradona: agrep: 1 word matches within 6 errors
Maradona, Diego Armando
Albert Einstein: agrep: 1 word matches within 5 errors
A. Einstein
Louis Pasteur: Louis Pasteur
frava dona: agrep: 2 words match within 4 errors
Maradona, Diego Armando
Fräva Dona
很好的示例,因为我们已经在最后三行中看到了问题。
建议以下技巧:
cat file1.csv file1.csv | sort | uniq -d
说明
cat file1.csv file1.csv
一个接一个地合并 bot 文件
sort
将相似的行放在一起
uniq -d
只打印有重复的行
最后我用 ruby
完成了工作,但我 post 一个 awk
解决方案。
有两个问题:
POSIX
[= =]
不适用于变音符号awk
不知道 multi-byte 个字符
可以通过将输入文件转换为 ASCII 来解决这个问题。 iconv
可以对罗马尼亚字符进行一些准确的处理,这正是我所需要的:
#!/bin/bash
to_ascii() {
case $(uname) in
Darwin)
iconv -f UTF-8 -t UTF-8-MAC "$@" |
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE
;;
Linux)
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE "$@"
;;
esac |
sed $'s/\r$//'
}
现在我们只需要做一点规范化,找到姓氏中的完美匹配:
awk '
{
gsub("-+","-");
gsub("+","");
gsub("[.[:space:]]+"," ");
sub("^[[:space:]]+","");
sub("[[:space:]]+$","");
}
{
if ([=11=] ~ /,/) {
n = split([=11=],a,"[[:space:]]*,[[:space:]]*");
lastname = a[n];
} else {
lastname = $NF;
}
gsub("[-[:space:]]+"," ",lastname);
lastname = tolower(lastname);
}
FNR == NR {
keys[lastname] = [=11=];
next;
}
{
count = 0;
for (n in keys) {
if (n == lastname) {
matches[++count] = keys[n]
}
}
if (count > 0) {
print [=11=]
for (i = 1; i <= count; i++) {
print "\t" matches[i]
}
}
}
' <(to_ascii "") <(to_ascii "")
输出
A Einstein
Albert Einstein
Louis Pasteur
Louis Pasteur
Diego Armando Maradona
Diego Maradona
Frava D'ona
frava d'ona
D Ruge
Dimitri Ruge
Francoise Barre-Sinoussi
Francoise, BARRE SINOUSSI