如何用AWK解决这个2TableData Join?

How to use AWK to solve this 2 Table Data Join?

我有 2 个数据表,如图所示(它们是 2 个制表符分隔的文件)。 我试图用 Table-1 中的相应国家/地区填充 Table-2 国家/地区列。需要 "join" 来自 Table-2 的名字字段中的信息。

考虑到 Table-2,名字列中数据的复杂性,最好的方法是什么?其他 Mac 工具会比 AWK 更好用吗? Excel 公式、Perl、Filemaker 等?

TABLE1(输入):

city_ascii  country iso2
Mavinga Angola  AO
Menongue    Angola  AO
Mucusso Angola  AO
Guines  Cuba    CU
Havana  Cuba    CU
Holguin Cuba    CU
Las Tunas   Cuba    CU
Manzanillo  Cuba    CU
Matanzas    Cuba    CU
Moron   Cuba    CU
Santa Clara Cuba    CU
Varadero    Cuba    CU

TABLE2(输入):

Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)

TABLE2(结果):

Firstname   Country
Fred, Havana  Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba   Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

============ 以下是对 Ed 问题的回答的调试信息:

awk -F'\t' '{print NF"<""><""><"">"}' Table3.txt | cat -v

    1<city_ascii  country iso2><><>
    1<Mavinga Angola  AO><><>
    1<Menongue    Angola  AO><><>
    1<Mucusso Angola  AO><><>
    1<Guines  Cuba    CU><><>
    1<Havana  Cuba    CU><><>
    1<Holguin Cuba    CU><><>
    1<Las Tunas   Cuba    CU><><>
    1<Manzanillo  Cuba    CU><><>
    1<Matanzas    Cuba    CU><><>
    1<Moron   Cuba    CU><><>
    1<Santa Clara Cuba    CU><><>
    1<Varadero    Cuba    CU><><>

    ==============
    awk -F'\t' '{print NF"<""><""><"">"}' Table4.txt | cat -v

    1<Firstname><><>
    1<Fred, Havana><><>
    1<James, (Varadero, Cuba)><><>
    1<Jack (Cuba)><><>
    1<Harry Varadero, Cuba><><>
    1<Josh Cuba><><>
    1<Gary, Mavinga & Other, Angola><><>
    1<Jamie, (Angola)><><>

    ===============
    cat -v tst.awk

    BEGIN { FS=OFS="\t" }
    NR==FNR {
        map[] = 
        map[] = 
        next
    }
    FNR==1 {
        print
        FS=" "
        next
    }
    {
        orig = [=14=]
        country = ""
        gsub(/[^[:alpha:]]/," ")
        for (i=NF; i>0; i--) {
            if ($i in map) {
                country = map[$i]
                break
            }
        }
        print orig, country
    }

    ===============
    awk -f tst.awk Table3.txt Table4.txt >output.txt

    Firstname
    Fred, Havana    
    James, (Varadero, Cuba) 
    Jack (Cuba) 
    Harry Varadero, Cuba    
    Josh Cuba   
    Gary, Mavinga & Other, Angola   
    Jamie, (Angola) 

    ================
    awk -F'\t' '{print NF"<""><""><"">"}' output.txt | cat -v

    1<Firstname><><>
    2<Fred, Havana><><>
    2<James, (Varadero, Cuba)><><>
    2<Jack (Cuba)><><>
    2<Harry Varadero, Cuba><><>
    2<Josh Cuba><><>
    2<Gary, Mavinga & Other, Angola><><>
    2<Jamie, (Angola)><><>
use DBI qw();
require DBD::CSV;
use List::Util 1.45 qw(uniq);

chdir '/tmp'; # location of csv files
my $dbh = DBI->connect("dbi:CSV:", undef, undef, {
    f_ext => '.csv',
    csv_sep_char => "\t",
    RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";

for my $country (
    uniq map { $_->[0] }
    # sql distinct not implemented
    $dbh->selectall_array('select country from table1')
) {
    $dbh->do(
        'update table2 set Country = ? where Firstname like ' .
            $dbh->quote("%$country%"),
        {},
        $country
    );
}

如果 我明白你在做什么,它正在使用这个 \t 分隔文件的第一列(城市)和第二列(国家):

city_ascii  country iso2
Mavinga Angola  AO
Menongue    Angola  AO
Mucusso Angola  AO
Guines  Cuba    CU
Havana  Cuba    CU
Holguin Cuba    CU
Las Tunas   Cuba    CU
Manzanillo  Cuba    CU
Matanzas    Cuba    CU
Moron   Cuba    CU
Santa   Clara   Cuba    CU
Varadero    Cuba    CU

并将此文件中的字符串与此单列文件一起匹配:

Firstname
Fred, Havana, Cuba
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)

在您的示例中生成两列文件。

awk 这样做:

awk -F '\t' 'FNR==NR{city[]=; ct[]; next}
             # ^^ FNR==NR means it is the first file; set city and country      
     FNR==1 {printf "%s\t%s\n", [=12=],"Country"; next}
     # ^^   second file, first line - print the header   
     {split([=12=], arr, /[^[:alpha:]]/)
      # ^ split word like things from paren, punctuation, etc
      for (e in arr) {s=arr[e]   # loop over those words
                      if (s in city) { printf "%s\t%s\n", [=12=],city[s]; next }
                      # ^ a city? print that
                      if (s in ct) { printf "%s\t%s\n", [=12=],s; next }}
                      # ^ a country? print that
                      }' file1 file2
Firstname   Country
Fred, Havana    Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba   Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

next 语句告诉 awk 转到文件的下一行。

听起来这可能就是您要找的:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    map[] = 
    map[] = 
    next
}
FNR==1 {
    print
    FS=" "
    next
}
{
    orig = [=10=]
    country = ""
    gsub(/[^[:alpha:]]/," ")
    for (i=NF; i>0; i--) {
        if ($i in map) {
            country = map[$i]
            break
        }
    }
    print orig, country
}

$ awk -f tst.awk file1 file2
Firstname       Country
Fred, Havana    Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba)     Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba       Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola