Pig - 针对 master table 映射和检索两列?
Pig - Mapping and retrieving two columns against master table?
我正在 openflights 数据集 (https://openflights.org/data.html) 上试验 pig。我目前正在尝试映射一个包含所有唯一可能飞行路线的查询,即下面的 table
+---------------+-------------+
| Start_Airport | End_Airport |
+---------------+-------------+
| YYZ | NYC |
| YBG | YVR |
| AEY | GOH |
+---------------+-------------+
然后将两个值与包含每个机场的经度和纬度的主 table 相结合。即
+---------+----------+-----------+
| Airport | Latitude | Longitude |
+---------+----------+-----------+
| YYZ | -10.3 | 1.23 |
| YBG | -40.3 | 50.4 |
| AEY | 30.3 | 30.3 |
+---------+----------+-----------+
我将如何着手尝试这样做?我本质上是想有一个最终的 table 看起来像
+----------------+----------+-----------+-------------+----------+-----------+
| Start_Airport | Latitude | Longitude | End_Airport | Latitude | Longitude |
+----------------+----------+-----------+-------------+----------+-----------+
| YYZ | -10.3 | 1.23 | NYC | blah | blah |
| YBG | -40.3 | 50.4 | YVR | blah | blah |
| AEY | 30.3 | 30.3 | GOH | blah | blah |
+----------------+----------+-----------+-------------+----------+-----------+
我目前正在尝试执行以下操作,其中 c 是第一个 table
route_data = JOIN c by (start_airport, end_airport), airports_all by ([=14=], [=14=]);
我认为这基本上是针对查询说的,将 starting_aiport 和 ending_airport 加入各自的代码,然后通过各自的经度和纬度,
route_data = 通过 (start_airport, end_airport), airports_all 通过 ($0, $0);
这类似于 sql 世界中典型连接查询的 "and" 条件子句。想象一下下面的查询。它会产生您想要的结果吗?
select * from c t1 join airports_all t2 on a.start_airport=b.first_field and a.end_airport=b.first_field;仅当 start_airport 和 end_airport 相同时才会产生结果。
你想要的可以通过以下方式实现:
cat > routes.txt
YYZ,NYC
YBG,YVR
AEY,GOH
cat > airports_all.txt
YYZ,-10.3,1.23
YBG,-40.3,50.4
AEY,30.3,30.3
猪码:
tab1 = load '/home/ec2-user/routes.txt' using PigStorage(',') as (start_airport,end_airport);
describe tab1
tab2 = load '/home/ec2-user/airports_all.txt' using PigStorage(',') as (Airport,Latitude,Longitude);
describe tab2
tab3 = JOIN tab1 by (start_airport), tab2 by (Airport);
describe tab3
tab4 = foreach tab3 generate [=11=] as start_airport, as start_Latitude, as start_Longitude, as end_airport;
describe tab4
tab5 = JOIN tab4 by (end_airport), tab2 by (Airport);
describe tab5
tab6 = foreach tab5 generate [=11=] as start_airport, as start_Latitude, as start_Longitude, as end_airport, as end_Latitude, as end_Longitude;
describe tab6
dump tab6
我正在 openflights 数据集 (https://openflights.org/data.html) 上试验 pig。我目前正在尝试映射一个包含所有唯一可能飞行路线的查询,即下面的 table
+---------------+-------------+
| Start_Airport | End_Airport |
+---------------+-------------+
| YYZ | NYC |
| YBG | YVR |
| AEY | GOH |
+---------------+-------------+
然后将两个值与包含每个机场的经度和纬度的主 table 相结合。即
+---------+----------+-----------+
| Airport | Latitude | Longitude |
+---------+----------+-----------+
| YYZ | -10.3 | 1.23 |
| YBG | -40.3 | 50.4 |
| AEY | 30.3 | 30.3 |
+---------+----------+-----------+
我将如何着手尝试这样做?我本质上是想有一个最终的 table 看起来像
+----------------+----------+-----------+-------------+----------+-----------+
| Start_Airport | Latitude | Longitude | End_Airport | Latitude | Longitude |
+----------------+----------+-----------+-------------+----------+-----------+
| YYZ | -10.3 | 1.23 | NYC | blah | blah |
| YBG | -40.3 | 50.4 | YVR | blah | blah |
| AEY | 30.3 | 30.3 | GOH | blah | blah |
+----------------+----------+-----------+-------------+----------+-----------+
我目前正在尝试执行以下操作,其中 c 是第一个 table
route_data = JOIN c by (start_airport, end_airport), airports_all by ([=14=], [=14=]);
我认为这基本上是针对查询说的,将 starting_aiport 和 ending_airport 加入各自的代码,然后通过各自的经度和纬度,
route_data = 通过 (start_airport, end_airport), airports_all 通过 ($0, $0);
这类似于 sql 世界中典型连接查询的 "and" 条件子句。想象一下下面的查询。它会产生您想要的结果吗? select * from c t1 join airports_all t2 on a.start_airport=b.first_field and a.end_airport=b.first_field;仅当 start_airport 和 end_airport 相同时才会产生结果。
你想要的可以通过以下方式实现:
cat > routes.txt
YYZ,NYC
YBG,YVR
AEY,GOH
cat > airports_all.txt
YYZ,-10.3,1.23
YBG,-40.3,50.4
AEY,30.3,30.3
猪码:
tab1 = load '/home/ec2-user/routes.txt' using PigStorage(',') as (start_airport,end_airport);
describe tab1
tab2 = load '/home/ec2-user/airports_all.txt' using PigStorage(',') as (Airport,Latitude,Longitude);
describe tab2
tab3 = JOIN tab1 by (start_airport), tab2 by (Airport);
describe tab3
tab4 = foreach tab3 generate [=11=] as start_airport, as start_Latitude, as start_Longitude, as end_airport;
describe tab4
tab5 = JOIN tab4 by (end_airport), tab2 by (Airport);
describe tab5
tab6 = foreach tab5 generate [=11=] as start_airport, as start_Latitude, as start_Longitude, as end_airport, as end_Latitude, as end_Longitude;
describe tab6
dump tab6