如何使用 Apache Pig 获得类似 GROUP BY 的 SQL?
How to get a SQL like GROUP BY using Apache Pig?
我有以下名为 movieUserTagFltr 的输入:
(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
它遵循架构:(movieId:int, tags:bag{(movieId:int, tag:cararray),...})
基本上第一个数字代表一个电影id,随后的包包含与该电影相关的所有关键词。我想以这样的方式对这些关键字进行分组,以便输出如下内容:
(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
请注意,对于 ID 为 260 的电影,标签 Science Fiction 出现了 4 次。使用 GROUP BY 和 COUNT,我使用以下脚本计算了每部电影的不同关键字:
sum = FOREACH group_data {
unique_tags = DISTINCT movieUserTagFltr.tags::tag;
GENERATE group, COUNT(unique_tags) as tag;
};
但是只有 returns 一个全局计数,我想要一个本地计数。所以我当时的逻辑是:
result = iterate over each tuple of group_data {
generate a tuple with [=14=], and a bag with {
foreach distinct tag that group_data has on it's variable do {
generate a tuple like: (tag_name, count of how many times that tag appeared on )
}
}
}
您可以展平您的原始输入,以便每个 movieID
和 tag
都是它们自己的记录。然后按 movieID
和 tag
分组以获得每个组合的计数。最后,按 movieID
分组,这样您就可以得到每部电影的一袋标签和计数。
假设您从 movieUserTagFltr
开始使用您描述的模式:
A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
FLATTEN(group) AS (movieID, tag),
COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
您的最终架构是:
D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}
我有以下名为 movieUserTagFltr 的输入:
(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
它遵循架构:(movieId:int, tags:bag{(movieId:int, tag:cararray),...})
基本上第一个数字代表一个电影id,随后的包包含与该电影相关的所有关键词。我想以这样的方式对这些关键字进行分组,以便输出如下内容:
(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
请注意,对于 ID 为 260 的电影,标签 Science Fiction 出现了 4 次。使用 GROUP BY 和 COUNT,我使用以下脚本计算了每部电影的不同关键字:
sum = FOREACH group_data {
unique_tags = DISTINCT movieUserTagFltr.tags::tag;
GENERATE group, COUNT(unique_tags) as tag;
};
但是只有 returns 一个全局计数,我想要一个本地计数。所以我当时的逻辑是:
result = iterate over each tuple of group_data {
generate a tuple with [=14=], and a bag with {
foreach distinct tag that group_data has on it's variable do {
generate a tuple like: (tag_name, count of how many times that tag appeared on )
}
}
}
您可以展平您的原始输入,以便每个 movieID
和 tag
都是它们自己的记录。然后按 movieID
和 tag
分组以获得每个组合的计数。最后,按 movieID
分组,这样您就可以得到每部电影的一袋标签和计数。
假设您从 movieUserTagFltr
开始使用您描述的模式:
A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
FLATTEN(group) AS (movieID, tag),
COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
您的最终架构是:
D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}