如何用AWK解决这个2TableData Join?
How to use AWK to solve this 2 Table Data Join?
我有 2 个数据表,如图所示(它们是 2 个制表符分隔的文件)。
我试图用 Table-1 中的相应国家/地区填充 Table-2 国家/地区列。需要 "join" 来自 Table-2 的名字字段中的信息。
考虑到 Table-2,名字列中数据的复杂性,最好的方法是什么?其他 Mac 工具会比 AWK 更好用吗? Excel 公式、Perl、Filemaker 等?
TABLE1(输入):
city_ascii country iso2
Mavinga Angola AO
Menongue Angola AO
Mucusso Angola AO
Guines Cuba CU
Havana Cuba CU
Holguin Cuba CU
Las Tunas Cuba CU
Manzanillo Cuba CU
Matanzas Cuba CU
Moron Cuba CU
Santa Clara Cuba CU
Varadero Cuba CU
TABLE2(输入):
Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)
TABLE2(结果):
Firstname Country
Fred, Havana Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba Cuba
Josh Cuba Cuba
Gary, Mavinga & Other, Angola Angola
Jamie, (Angola) Angola
============
以下是对 Ed 问题的回答的调试信息:
awk -F'\t' '{print NF"<""><""><"">"}' Table3.txt | cat -v
1<city_ascii country iso2><><>
1<Mavinga Angola AO><><>
1<Menongue Angola AO><><>
1<Mucusso Angola AO><><>
1<Guines Cuba CU><><>
1<Havana Cuba CU><><>
1<Holguin Cuba CU><><>
1<Las Tunas Cuba CU><><>
1<Manzanillo Cuba CU><><>
1<Matanzas Cuba CU><><>
1<Moron Cuba CU><><>
1<Santa Clara Cuba CU><><>
1<Varadero Cuba CU><><>
==============
awk -F'\t' '{print NF"<""><""><"">"}' Table4.txt | cat -v
1<Firstname><><>
1<Fred, Havana><><>
1<James, (Varadero, Cuba)><><>
1<Jack (Cuba)><><>
1<Harry Varadero, Cuba><><>
1<Josh Cuba><><>
1<Gary, Mavinga & Other, Angola><><>
1<Jamie, (Angola)><><>
===============
cat -v tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
map[] =
map[] =
next
}
FNR==1 {
print
FS=" "
next
}
{
orig = [=14=]
country = ""
gsub(/[^[:alpha:]]/," ")
for (i=NF; i>0; i--) {
if ($i in map) {
country = map[$i]
break
}
}
print orig, country
}
===============
awk -f tst.awk Table3.txt Table4.txt >output.txt
Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)
================
awk -F'\t' '{print NF"<""><""><"">"}' output.txt | cat -v
1<Firstname><><>
2<Fred, Havana><><>
2<James, (Varadero, Cuba)><><>
2<Jack (Cuba)><><>
2<Harry Varadero, Cuba><><>
2<Josh Cuba><><>
2<Gary, Mavinga & Other, Angola><><>
2<Jamie, (Angola)><><>
use DBI qw();
require DBD::CSV;
use List::Util 1.45 qw(uniq);
chdir '/tmp'; # location of csv files
my $dbh = DBI->connect("dbi:CSV:", undef, undef, {
f_ext => '.csv',
csv_sep_char => "\t",
RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";
for my $country (
uniq map { $_->[0] }
# sql distinct not implemented
$dbh->selectall_array('select country from table1')
) {
$dbh->do(
'update table2 set Country = ? where Firstname like ' .
$dbh->quote("%$country%"),
{},
$country
);
}
如果 我明白你在做什么,它正在使用这个 \t
分隔文件的第一列(城市)和第二列(国家):
city_ascii country iso2
Mavinga Angola AO
Menongue Angola AO
Mucusso Angola AO
Guines Cuba CU
Havana Cuba CU
Holguin Cuba CU
Las Tunas Cuba CU
Manzanillo Cuba CU
Matanzas Cuba CU
Moron Cuba CU
Santa Clara Cuba CU
Varadero Cuba CU
并将此文件中的字符串与此单列文件一起匹配:
Firstname
Fred, Havana, Cuba
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)
在您的示例中生成两列文件。
awk
这样做:
awk -F '\t' 'FNR==NR{city[]=; ct[]; next}
# ^^ FNR==NR means it is the first file; set city and country
FNR==1 {printf "%s\t%s\n", [=12=],"Country"; next}
# ^^ second file, first line - print the header
{split([=12=], arr, /[^[:alpha:]]/)
# ^ split word like things from paren, punctuation, etc
for (e in arr) {s=arr[e] # loop over those words
if (s in city) { printf "%s\t%s\n", [=12=],city[s]; next }
# ^ a city? print that
if (s in ct) { printf "%s\t%s\n", [=12=],s; next }}
# ^ a country? print that
}' file1 file2
Firstname Country
Fred, Havana Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba Cuba
Josh Cuba Cuba
Gary, Mavinga & Other, Angola Angola
Jamie, (Angola) Angola
next
语句告诉 awk
转到文件的下一行。
听起来这可能就是您要找的:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
map[] =
map[] =
next
}
FNR==1 {
print
FS=" "
next
}
{
orig = [=10=]
country = ""
gsub(/[^[:alpha:]]/," ")
for (i=NF; i>0; i--) {
if ($i in map) {
country = map[$i]
break
}
}
print orig, country
}
$ awk -f tst.awk file1 file2
Firstname Country
Fred, Havana Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba Cuba
Josh Cuba Cuba
Gary, Mavinga & Other, Angola Angola
Jamie, (Angola) Angola
我有 2 个数据表,如图所示(它们是 2 个制表符分隔的文件)。 我试图用 Table-1 中的相应国家/地区填充 Table-2 国家/地区列。需要 "join" 来自 Table-2 的名字字段中的信息。
考虑到 Table-2,名字列中数据的复杂性,最好的方法是什么?其他 Mac 工具会比 AWK 更好用吗? Excel 公式、Perl、Filemaker 等?
TABLE1(输入):
city_ascii country iso2
Mavinga Angola AO
Menongue Angola AO
Mucusso Angola AO
Guines Cuba CU
Havana Cuba CU
Holguin Cuba CU
Las Tunas Cuba CU
Manzanillo Cuba CU
Matanzas Cuba CU
Moron Cuba CU
Santa Clara Cuba CU
Varadero Cuba CU
TABLE2(输入):
Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)
TABLE2(结果):
Firstname Country
Fred, Havana Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba Cuba
Josh Cuba Cuba
Gary, Mavinga & Other, Angola Angola
Jamie, (Angola) Angola
============ 以下是对 Ed 问题的回答的调试信息:
awk -F'\t' '{print NF"<""><""><"">"}' Table3.txt | cat -v
1<city_ascii country iso2><><>
1<Mavinga Angola AO><><>
1<Menongue Angola AO><><>
1<Mucusso Angola AO><><>
1<Guines Cuba CU><><>
1<Havana Cuba CU><><>
1<Holguin Cuba CU><><>
1<Las Tunas Cuba CU><><>
1<Manzanillo Cuba CU><><>
1<Matanzas Cuba CU><><>
1<Moron Cuba CU><><>
1<Santa Clara Cuba CU><><>
1<Varadero Cuba CU><><>
==============
awk -F'\t' '{print NF"<""><""><"">"}' Table4.txt | cat -v
1<Firstname><><>
1<Fred, Havana><><>
1<James, (Varadero, Cuba)><><>
1<Jack (Cuba)><><>
1<Harry Varadero, Cuba><><>
1<Josh Cuba><><>
1<Gary, Mavinga & Other, Angola><><>
1<Jamie, (Angola)><><>
===============
cat -v tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
map[] =
map[] =
next
}
FNR==1 {
print
FS=" "
next
}
{
orig = [=14=]
country = ""
gsub(/[^[:alpha:]]/," ")
for (i=NF; i>0; i--) {
if ($i in map) {
country = map[$i]
break
}
}
print orig, country
}
===============
awk -f tst.awk Table3.txt Table4.txt >output.txt
Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)
================
awk -F'\t' '{print NF"<""><""><"">"}' output.txt | cat -v
1<Firstname><><>
2<Fred, Havana><><>
2<James, (Varadero, Cuba)><><>
2<Jack (Cuba)><><>
2<Harry Varadero, Cuba><><>
2<Josh Cuba><><>
2<Gary, Mavinga & Other, Angola><><>
2<Jamie, (Angola)><><>
use DBI qw();
require DBD::CSV;
use List::Util 1.45 qw(uniq);
chdir '/tmp'; # location of csv files
my $dbh = DBI->connect("dbi:CSV:", undef, undef, {
f_ext => '.csv',
csv_sep_char => "\t",
RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";
for my $country (
uniq map { $_->[0] }
# sql distinct not implemented
$dbh->selectall_array('select country from table1')
) {
$dbh->do(
'update table2 set Country = ? where Firstname like ' .
$dbh->quote("%$country%"),
{},
$country
);
}
如果 我明白你在做什么,它正在使用这个 \t
分隔文件的第一列(城市)和第二列(国家):
city_ascii country iso2
Mavinga Angola AO
Menongue Angola AO
Mucusso Angola AO
Guines Cuba CU
Havana Cuba CU
Holguin Cuba CU
Las Tunas Cuba CU
Manzanillo Cuba CU
Matanzas Cuba CU
Moron Cuba CU
Santa Clara Cuba CU
Varadero Cuba CU
并将此文件中的字符串与此单列文件一起匹配:
Firstname
Fred, Havana, Cuba
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)
在您的示例中生成两列文件。
awk
这样做:
awk -F '\t' 'FNR==NR{city[]=; ct[]; next}
# ^^ FNR==NR means it is the first file; set city and country
FNR==1 {printf "%s\t%s\n", [=12=],"Country"; next}
# ^^ second file, first line - print the header
{split([=12=], arr, /[^[:alpha:]]/)
# ^ split word like things from paren, punctuation, etc
for (e in arr) {s=arr[e] # loop over those words
if (s in city) { printf "%s\t%s\n", [=12=],city[s]; next }
# ^ a city? print that
if (s in ct) { printf "%s\t%s\n", [=12=],s; next }}
# ^ a country? print that
}' file1 file2
Firstname Country
Fred, Havana Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba Cuba
Josh Cuba Cuba
Gary, Mavinga & Other, Angola Angola
Jamie, (Angola) Angola
next
语句告诉 awk
转到文件的下一行。
听起来这可能就是您要找的:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
map[] =
map[] =
next
}
FNR==1 {
print
FS=" "
next
}
{
orig = [=10=]
country = ""
gsub(/[^[:alpha:]]/," ")
for (i=NF; i>0; i--) {
if ($i in map) {
country = map[$i]
break
}
}
print orig, country
}
$ awk -f tst.awk file1 file2
Firstname Country
Fred, Havana Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba Cuba
Josh Cuba Cuba
Gary, Mavinga & Other, Angola Angola
Jamie, (Angola) Angola