根据条件加入并按时间范围过滤并限制为 Pig 中的第一行
Join based on condition and filter by timerange & limit to just the first row in Pig
我有关系 A,关系 B。对于 A 中的每一行,在关系 B 中可能有多个映射。
说:
A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)
我需要在(类型、地点和性别)和(startDateTime > registerStartDateTime)和(startDateTime < registerEndDateTime)时间加入 A 和 B
此联接可能 return 来自 B 的具有不同值的多行。我只想选择第一个 returned 行并最终输出。
output = Join A by (type, location, gender), B by (type, location, gender)
如何在上面的join中添加日期时间范围条件?
执行join时如何限制B只排一行?
在SQL中:
SELECT
a.id, b.value
FROM
a, b
WHERE
a.type = b.type
AND a.location = b.location
AND a.gender = b.gender
AND a.startDateTime between b.registerStartDateTime and b.registerEndDateTime
limit 1;
如何在 Pig 中做同样的事情?
试试这个:
A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)
output = Join A by (type, location, gender), B by (type, location, gender)
filteroutput = filter output by (startDateTime > registerStartDateTime) AND (startDateTime < registerEndDateTime);
/*sortoutput = order filteroutput by startDateTime ;
limitoutput = limit sortoutput 1 ;
*/
limitoutput = limit filteroutput 1 ;
我有关系 A,关系 B。对于 A 中的每一行,在关系 B 中可能有多个映射。
说:
A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)
我需要在(类型、地点和性别)和(startDateTime > registerStartDateTime)和(startDateTime < registerEndDateTime)时间加入 A 和 B
此联接可能 return 来自 B 的具有不同值的多行。我只想选择第一个 returned 行并最终输出。
output = Join A by (type, location, gender), B by (type, location, gender)
如何在上面的join中添加日期时间范围条件? 执行join时如何限制B只排一行?
在SQL中:
SELECT
a.id, b.value
FROM
a, b
WHERE
a.type = b.type
AND a.location = b.location
AND a.gender = b.gender
AND a.startDateTime between b.registerStartDateTime and b.registerEndDateTime
limit 1;
如何在 Pig 中做同样的事情?
试试这个:
A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)
output = Join A by (type, location, gender), B by (type, location, gender)
filteroutput = filter output by (startDateTime > registerStartDateTime) AND (startDateTime < registerEndDateTime);
/*sortoutput = order filteroutput by startDateTime ;
limitoutput = limit sortoutput 1 ;
*/
limitoutput = limit filteroutput 1 ;