hive 条件下的采样结果 sql
Sampling results with conditions in hive sql
我有一个 table 没有主键并且按日期分区;像这样的列:
1. user_id
2. device
3. region
4. datetime
5. and other columns
它包含来自网站游戏的用户生成事件,它们每秒触发一次。我想 return 一批包含当天前 6 位用户(table 顶部)生成的所有事件(包括重复行)的检查条件:
地区 = 美国
- one user from iOS
- one user from android
- one user from PC
区域 = 欧盟
- one user from iOS
- one user from android
- one user from PC
你能提供我应该从哪里开始的示例代码吗?我的一个朋友提出了一些关于 RANK() 的建议,但我从未使用过它。
谢谢!
SELECT * FROM
(SELECT user_id,
event_post_time,
device,
region,
COUNT(DISTINCT player_id) over (partition by player_id) as ct_pid,
COUNT(DISTINCT region) over (partition by region) as ct_region,
COUNT(DISTINCT device) over (partition by device) as ct_device
FROM events
WHERE event_post_time = current_date()
AND region IN ('EU','US')
AND device IN ('ios','android','pc')) e
WHERE ct_pid <= 6
AND ct_region <= 2
AND ct_device <= 3
ORDER BY player_id
在 SQLFiddle 添加虚拟数据
和预期输出:
user_id device region date_generated
1 ios EU 22-05-18
1 ios EU 22-05-18
1 ios EU 22-05-18
4 ios US 22-05-18
4 ios US 22-05-18
2 android EU 22-05-18
2 android US 22-05-18
4 pc EU 22-05-18
4 pc EU 22-05-18
4 pc EU 22-05-18
5 pc US 22-05-18
可能,这就是您要找的。
select * from (
select rank() over (partition by region,device order by cn desc) as
top_num,player_id, region,device,cn from
(
select count(*) as cn , player_id,region,device from
test_table group by player_id,region,device
)l
)t
where top_num = 1;
如果有帮助请告诉我。
OP 编辑:
我设法使用您提供的查询使它能够满足我的需求;这是最后一个
WITH combo
AS (SELECT user_id,
region,
device
FROM (SELECT Rank()
OVER (
partition BY region, device
ORDER BY cn DESC) AS top_num,
user_id,
region,
device,
cn
FROM (SELECT Count(*) AS cn,
user_id,
region,
device
FROM samples
GROUP BY user_id,
region,
device)l)t
WHERE top_num = 1)
SELECT s.user_id,
s.region,
s.device
FROM samples s
JOIN combo
ON s.user_id = combo.user_id
AND s.region = combo.region
AND s.device = combo.device
我有一个 table 没有主键并且按日期分区;像这样的列:
1. user_id
2. device
3. region
4. datetime
5. and other columns
它包含来自网站游戏的用户生成事件,它们每秒触发一次。我想 return 一批包含当天前 6 位用户(table 顶部)生成的所有事件(包括重复行)的检查条件:
地区 = 美国
- one user from iOS
- one user from android
- one user from PC
区域 = 欧盟
- one user from iOS
- one user from android
- one user from PC
你能提供我应该从哪里开始的示例代码吗?我的一个朋友提出了一些关于 RANK() 的建议,但我从未使用过它。
谢谢!
SELECT * FROM
(SELECT user_id,
event_post_time,
device,
region,
COUNT(DISTINCT player_id) over (partition by player_id) as ct_pid,
COUNT(DISTINCT region) over (partition by region) as ct_region,
COUNT(DISTINCT device) over (partition by device) as ct_device
FROM events
WHERE event_post_time = current_date()
AND region IN ('EU','US')
AND device IN ('ios','android','pc')) e
WHERE ct_pid <= 6
AND ct_region <= 2
AND ct_device <= 3
ORDER BY player_id
在 SQLFiddle 添加虚拟数据 和预期输出:
user_id device region date_generated
1 ios EU 22-05-18
1 ios EU 22-05-18
1 ios EU 22-05-18
4 ios US 22-05-18
4 ios US 22-05-18
2 android EU 22-05-18
2 android US 22-05-18
4 pc EU 22-05-18
4 pc EU 22-05-18
4 pc EU 22-05-18
5 pc US 22-05-18
可能,这就是您要找的。
select * from (
select rank() over (partition by region,device order by cn desc) as
top_num,player_id, region,device,cn from
(
select count(*) as cn , player_id,region,device from
test_table group by player_id,region,device
)l
)t
where top_num = 1;
如果有帮助请告诉我。
OP 编辑: 我设法使用您提供的查询使它能够满足我的需求;这是最后一个
WITH combo
AS (SELECT user_id,
region,
device
FROM (SELECT Rank()
OVER (
partition BY region, device
ORDER BY cn DESC) AS top_num,
user_id,
region,
device,
cn
FROM (SELECT Count(*) AS cn,
user_id,
region,
device
FROM samples
GROUP BY user_id,
region,
device)l)t
WHERE top_num = 1)
SELECT s.user_id,
s.region,
s.device
FROM samples s
JOIN combo
ON s.user_id = combo.user_id
AND s.region = combo.region
AND s.device = combo.device