Apache Pig 区分和计数

Apache Pig Distinct and Count

我想弄清楚以下问题。

有多少女性用户给出了至少一个 4 分的评分。我认为我的加入和过滤器是正确的,但我无法弄清楚非重复计数部分已经尝试了以下多个版本。

a = load '/user/pig/movie' AS (userid:int, movieid:int, rating:int, timestamp:chararray);
b = load '/user/pig/reviewer' using PigStorage('|') AS (userid:int, age:int, gender:chararray, occupation:chararray, zip:chararray);
a1 = filter a by rating == 4;
b1 = filter b by gender == 'F';
c = join a1 by userid, b1 by userid;
d = FOREACH c GENERATE COUNT(DISTINCT(userid));
dump d;

您必须在 COUNT.Ref 之前进行 GROUP:COUNT 需要一个用于全局计数的 GROUP ALL 语句和一个用于组计数的 GROUP BY 语句。

d = GROUP c BY userid;
e = FOREACH d GENERATE COUNT(DISTINCT(b1.userid));
dump e;