Hive 查询为匹配条件的一系列行生成标识符
Hive query generating identifiers for a sequence of row matching a condition
假设我有以下配置单元 table 作为输入,我们称它为 connections
:
userid | timestamp
--------|-------------
1 | 1433258019
1 | 1433258020
2 | 1433258080
2 | 1433258083
2 | 1433258088
2 | 1433258170
[...] | [...]
使用以下查询:
SELECT
userid,
timestamp,
timestamp - LAG(timestamp, 1, 0) OVER w AS timediff
CASE
WHEN timediff > 60
THEN 'new_session'
ELSE 'same_session'
END AS session_state
FROM connections
WINDOW w PARTITION BY userid ORDER BY timestamp ASC;
我正在生成以下输出:
userid | timestamp | timediff | session_state
--------|-------------|------------|---------------
1 | 1433258019 | 1433258019 | new_session
1 | 1433258020 | 1 | same_session
2 | 1433258080 | 1433258080 | new_session
2 | 1433258083 | 3 | same_session
2 | 1433258088 | 5 | same_session
2 | 1433258170 | 82 | new_session
[...] | [...] | [...] | [...]
我将如何生成:
userid | timestamp | timediff | sessionid
--------|-------------|------------------------------
1 | 1433258019 | 1433258019 | user1-session-1
1 | 1433258020 | 1 | user1-session-1
2 | 1433258080 | 1433258080 | user2-session-1
2 | 1433258083 | 3 | user2-session-1
2 | 1433258088 | 5 | user2-session-1
2 | 1433258170 | 82 | user2-session-2
[...] | [...] | [...] | [...]
是否可以仅使用 HQL 和 "famous" UDF(我宁愿不使用自定义 UDF 或 reducer 脚本)?
使用以下
select concat_ws('-',name, city) from employee; concat_ws 的第一个参数是分隔符。 name 和 city 是员工 table 的列名。看到它们是字符串类型。您可以查看 here 了解更多
这个有效:
SELECT
userid,
timestamp,
timediff,
CONCAT(
'user',
userid,
'-',
'session-',
CAST(timediff / 60 AS INT) + 1
) AS session_id
FROM (
SELECT
userid,
timestamp,
timestamp - LAG(timestamp, 1, timestamp) OVER w AS timediff
FROM connections
WINDOW w AS (
PARTITION BY userid
ORDER BY timestamp ASC
)
) a;
输出:
userid timestamp timediff session_state
1 1433258019 0.0 user1-session-1
1 1433258020 1.0 user1-session-1
2 1433258080 0.0 user2-session-1
2 1433258083 3.0 user2-session-1
2 1433258088 5.0 user2-session-1
2 1433258170 82.0 user2-session-2
3 1433258270 0.0 user3-session-1
如果不需要 timediff,您可以尝试这样的操作:
select userid,timestamp ,session_count+ concat('user',userid,'-','session-',cast(LAG(session_count- 1,1,0) 在 w1 上作为字符串)) AS session_state
--LAG(session_count-1,1,0) over w1 AS session_count_new
从
(select
用户身份,
时间戳,
时差,
cast (timediff/60 as int)+1 as session_count
有趣的问题。根据您对@Madhu 的评论,我在您的示例中添加了 2 1433258172
行。您需要的是每次 timediff > 60
满足时递增。最简单的方法是标记它,然后对 window.
进行累加求和
查询:
select userid
, timestamp
, concat('user', userid, '-session-', s_sum) sessionid
from (
select *
, sum( counter ) over (partition by userid
order by timestamp asc
rows between unbounded preceding and current row) s_sum
from (
select *
, case when timediff > 60 then 1 else 0 end as counter
from (
select userid
, timestamp
, timestamp - lag(timestamp, 1, 0) over (partition by userid
order by timestamp asc) timediff
from connections ) x ) y ) z
输出:
1 1433258019 user1-session-1
1 1433258020 user1-session-1
2 1433258080 user2-session-1
2 1433258083 user2-session-1
2 1433258088 user2-session-1
2 1433258170 user2-session-2
2 1433258172 user2-session-2
假设我有以下配置单元 table 作为输入,我们称它为 connections
:
userid | timestamp
--------|-------------
1 | 1433258019
1 | 1433258020
2 | 1433258080
2 | 1433258083
2 | 1433258088
2 | 1433258170
[...] | [...]
使用以下查询:
SELECT
userid,
timestamp,
timestamp - LAG(timestamp, 1, 0) OVER w AS timediff
CASE
WHEN timediff > 60
THEN 'new_session'
ELSE 'same_session'
END AS session_state
FROM connections
WINDOW w PARTITION BY userid ORDER BY timestamp ASC;
我正在生成以下输出:
userid | timestamp | timediff | session_state
--------|-------------|------------|---------------
1 | 1433258019 | 1433258019 | new_session
1 | 1433258020 | 1 | same_session
2 | 1433258080 | 1433258080 | new_session
2 | 1433258083 | 3 | same_session
2 | 1433258088 | 5 | same_session
2 | 1433258170 | 82 | new_session
[...] | [...] | [...] | [...]
我将如何生成:
userid | timestamp | timediff | sessionid
--------|-------------|------------------------------
1 | 1433258019 | 1433258019 | user1-session-1
1 | 1433258020 | 1 | user1-session-1
2 | 1433258080 | 1433258080 | user2-session-1
2 | 1433258083 | 3 | user2-session-1
2 | 1433258088 | 5 | user2-session-1
2 | 1433258170 | 82 | user2-session-2
[...] | [...] | [...] | [...]
是否可以仅使用 HQL 和 "famous" UDF(我宁愿不使用自定义 UDF 或 reducer 脚本)?
使用以下 select concat_ws('-',name, city) from employee; concat_ws 的第一个参数是分隔符。 name 和 city 是员工 table 的列名。看到它们是字符串类型。您可以查看 here 了解更多
这个有效:
SELECT
userid,
timestamp,
timediff,
CONCAT(
'user',
userid,
'-',
'session-',
CAST(timediff / 60 AS INT) + 1
) AS session_id
FROM (
SELECT
userid,
timestamp,
timestamp - LAG(timestamp, 1, timestamp) OVER w AS timediff
FROM connections
WINDOW w AS (
PARTITION BY userid
ORDER BY timestamp ASC
)
) a;
输出:
userid timestamp timediff session_state
1 1433258019 0.0 user1-session-1
1 1433258020 1.0 user1-session-1
2 1433258080 0.0 user2-session-1
2 1433258083 3.0 user2-session-1
2 1433258088 5.0 user2-session-1
2 1433258170 82.0 user2-session-2
3 1433258270 0.0 user3-session-1
如果不需要 timediff,您可以尝试这样的操作:
select userid,timestamp ,session_count+ concat('user',userid,'-','session-',cast(LAG(session_count- 1,1,0) 在 w1 上作为字符串)) AS session_state
--LAG(session_count-1,1,0) over w1 AS session_count_new
从
(select
用户身份,
时间戳,
时差,
cast (timediff/60 as int)+1 as session_count
有趣的问题。根据您对@Madhu 的评论,我在您的示例中添加了 2 1433258172
行。您需要的是每次 timediff > 60
满足时递增。最简单的方法是标记它,然后对 window.
查询:
select userid
, timestamp
, concat('user', userid, '-session-', s_sum) sessionid
from (
select *
, sum( counter ) over (partition by userid
order by timestamp asc
rows between unbounded preceding and current row) s_sum
from (
select *
, case when timediff > 60 then 1 else 0 end as counter
from (
select userid
, timestamp
, timestamp - lag(timestamp, 1, 0) over (partition by userid
order by timestamp asc) timediff
from connections ) x ) y ) z
输出:
1 1433258019 user1-session-1
1 1433258020 user1-session-1
2 1433258080 user2-session-1
2 1433258083 user2-session-1
2 1433258088 user2-session-1
2 1433258170 user2-session-2
2 1433258172 user2-session-2