配置单元:select 前第 n 行具有列的最小值
Hive: select previous nth row with minimum value for a column
我有这样的数据
ID START_DATE STATUS
10 2013-05-29 FREE
10 2013-05-29 PAID
10 2014-05-30 PAID
10 2014-11-29 FREE
10 2014-12-02 PAID
10 2015-09-29 PAID
10 2015-12-02 PAID
10 2016-04-04 PAID
10 2016-04-05 FREE
我的输出应该只包含状态 = "FREE" 的行。每次状态为免费时,我需要获取状态 = "PAID".
的前一个最小日期
ID STATUS PREVIOUS_MIN_PAID_START_DATE
10 FREE NULL
10 FREE 2013-05-29
10 FREE 2014-12-02
LAG()
函数只给出前一个值,如何获取前一个最小值(第n个)?
SELECT
ID,
STATUS,
LAG(CASE WHEN STATUS = 'PAID' THEN START_DATE, 1)
OVER (PARTITION BY ID ORDER BY START_DATE) AS previous_paid_start_date
FROM
TEMP
WHERE
STATUS = 'FREE'
不确定您为什么会收到反对票,我认为这是一个非常有趣(并且描述得很好的问题)。无论如何,这是 一种方法 来做到这一点,尽管我必须承认它感觉不是最理想的和 hacky。
基本上,您需要的是创建一组 "current" FREE
和所有后续 PAID
直到到达下一个 FREE
的索引(我希望我已经正确理解了这一点)。举例说明:
id start_date status idx
10 2013-05-29 FREE 0
10 2013-05-29 PAID 1
10 2014-05-30 PAID 1
10 2014-11-29 FREE 1
10 2014-12-02 PAID 2
10 2015-09-29 PAID 2
10 2015-12-02 PAID 2
10 2016-04-04 PAID 2
10 2016-04-05 FREE 2
然后从那里你可以获得最小值 start_date,其中状态为 PAID
,超过了 id 的 window 和新创建的索引。
查询:
WITH tmp_table AS (
SELECT *
, SUM(flg) OVER (PARTITION BY id ROWS UNBOUNDED PRECEDING) AS s
FROM (
SELECT *
, LEAD(CASE WHEN status='FREE' THEN 1 ELSE 0 END, 1, 0) OVER (PARTITION BY id) AS flg
FROM database.original_table) x )
SELECT id
, status
, prev_date
FROM (
SELECT t.id, t.status, t.s, b.prev_date
FROM tmp_table t
LEFT OUTER JOIN (
SELECT id, s, MIN(start_date) AS prev_date
FROM tmp_table
WHERE status='PAID'
GROUP BY id, s ) b
ON b.id=t.id AND b.s=t.s ) f
WHERE status='FREE'
输出:
id status prev_date
10 FREE NULL
10 FREE 2013-05-29
10 FREE 2014-12-02
我有这样的数据
ID START_DATE STATUS
10 2013-05-29 FREE
10 2013-05-29 PAID
10 2014-05-30 PAID
10 2014-11-29 FREE
10 2014-12-02 PAID
10 2015-09-29 PAID
10 2015-12-02 PAID
10 2016-04-04 PAID
10 2016-04-05 FREE
我的输出应该只包含状态 = "FREE" 的行。每次状态为免费时,我需要获取状态 = "PAID".
的前一个最小日期ID STATUS PREVIOUS_MIN_PAID_START_DATE
10 FREE NULL
10 FREE 2013-05-29
10 FREE 2014-12-02
LAG()
函数只给出前一个值,如何获取前一个最小值(第n个)?
SELECT
ID,
STATUS,
LAG(CASE WHEN STATUS = 'PAID' THEN START_DATE, 1)
OVER (PARTITION BY ID ORDER BY START_DATE) AS previous_paid_start_date
FROM
TEMP
WHERE
STATUS = 'FREE'
不确定您为什么会收到反对票,我认为这是一个非常有趣(并且描述得很好的问题)。无论如何,这是 一种方法 来做到这一点,尽管我必须承认它感觉不是最理想的和 hacky。
基本上,您需要的是创建一组 "current" FREE
和所有后续 PAID
直到到达下一个 FREE
的索引(我希望我已经正确理解了这一点)。举例说明:
id start_date status idx
10 2013-05-29 FREE 0
10 2013-05-29 PAID 1
10 2014-05-30 PAID 1
10 2014-11-29 FREE 1
10 2014-12-02 PAID 2
10 2015-09-29 PAID 2
10 2015-12-02 PAID 2
10 2016-04-04 PAID 2
10 2016-04-05 FREE 2
然后从那里你可以获得最小值 start_date,其中状态为 PAID
,超过了 id 的 window 和新创建的索引。
查询:
WITH tmp_table AS (
SELECT *
, SUM(flg) OVER (PARTITION BY id ROWS UNBOUNDED PRECEDING) AS s
FROM (
SELECT *
, LEAD(CASE WHEN status='FREE' THEN 1 ELSE 0 END, 1, 0) OVER (PARTITION BY id) AS flg
FROM database.original_table) x )
SELECT id
, status
, prev_date
FROM (
SELECT t.id, t.status, t.s, b.prev_date
FROM tmp_table t
LEFT OUTER JOIN (
SELECT id, s, MIN(start_date) AS prev_date
FROM tmp_table
WHERE status='PAID'
GROUP BY id, s ) b
ON b.id=t.id AND b.s=t.s ) f
WHERE status='FREE'
输出:
id status prev_date
10 FREE NULL
10 FREE 2013-05-29
10 FREE 2014-12-02