建议使用蜂巢或猪的最优化方式
Suggest most optimized way using hive or pig
问题陈述
假设有一个日志文本文件。以下是文件中的字段。
日志文件
userID
productID
action
Action 是其中之一 –
Browse, Click, AddToCart, Purchase, LogOut
Select 位执行了 AddToCart 操作但未执行 Purchase 操作的用户。
('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')
任何人都可以建议使用具有优化性能的配置单元或猪来获取此信息
Hive: 使用not in
select * from table
where action='AddtoCart' and
userID not in (select distinct userID from table where action='Purchase')
Pig: 使用 action 过滤 id 并进行左连接并检查 id 是否为 null
A = LOAD '\path\file.txt' USING PigStorage(',') AS (userID:int,b:int,c:int,action:chararray) -- Note I am assuming the first 3 columns are int.You will have to figure out the loading without the quotes.
B = FILTER A BY (action='AddToCart');
C = FILTER A BY (action='Purchase');
D = JOIN B BY userID LEFT OUTER,C BY userID;
E = FILTER D BY C.userID is null;
DUMP E;
这可以使用 sum() 或分析 sum() 来完成,具体取决于单个 table 扫描中的确切要求。如果用户将两种产品添加到购物车,但只购买了一种怎么办?
对于用户+产品:
select userID, productID
from
(
select
userID,
productID,
sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
sum(case when action='Purchase' then 1 else 0 end) Purchase_cnt
from table
group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0
问题陈述 假设有一个日志文本文件。以下是文件中的字段。
日志文件
userID
productID
action
Action 是其中之一 –
Browse, Click, AddToCart, Purchase, LogOut
Select 位执行了 AddToCart 操作但未执行 Purchase 操作的用户。
('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')
任何人都可以建议使用具有优化性能的配置单元或猪来获取此信息
Hive: 使用not in
select * from table
where action='AddtoCart' and
userID not in (select distinct userID from table where action='Purchase')
Pig: 使用 action 过滤 id 并进行左连接并检查 id 是否为 null
A = LOAD '\path\file.txt' USING PigStorage(',') AS (userID:int,b:int,c:int,action:chararray) -- Note I am assuming the first 3 columns are int.You will have to figure out the loading without the quotes.
B = FILTER A BY (action='AddToCart');
C = FILTER A BY (action='Purchase');
D = JOIN B BY userID LEFT OUTER,C BY userID;
E = FILTER D BY C.userID is null;
DUMP E;
这可以使用 sum() 或分析 sum() 来完成,具体取决于单个 table 扫描中的确切要求。如果用户将两种产品添加到购物车,但只购买了一种怎么办?
对于用户+产品:
select userID, productID
from
(
select
userID,
productID,
sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
sum(case when action='Purchase' then 1 else 0 end) Purchase_cnt
from table
group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0