根据条件加入并按时间范围过滤并限制为 Pig 中的第一行

Join based on condition and filter by timerange & limit to just the first row in Pig

我有关系 A,关系 B。对于 A 中的每一行,在关系 B 中可能有多个映射。

说:

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

我需要在(类型、地点和性别)和(startDateTime > registerStartDateTime)和(startDateTime < registerEndDateTime)时间加入 A 和 B

此联接可能 return 来自 B 的具有不同值的多行。我只想选择第一个 returned 行并最终输出。

output = Join A by (type, location, gender), B by (type, location, gender)

如何在上面的join中添加日期时间范围条件? 执行join时如何限制B只排一行?

在SQL中:

SELECT 
a.id, b.value
FROM
    a, b
WHERE
    a.type = b.type
        AND a.location = b.location
        AND a.gender = b.gender
        AND a.startDateTime between b.registerStartDateTime and b.registerEndDateTime 
limit 1;

如何在 Pig 中做同样的事情?

试试这个:

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

output = Join A by (type, location, gender), B by (type, location, gender)

filteroutput = filter output by (startDateTime > registerStartDateTime) AND (startDateTime < registerEndDateTime);

/*sortoutput = order filteroutput by  startDateTime ; 

  limitoutput = limit sortoutput 1 ;
*/

  limitoutput = limit filteroutput 1 ;