我如何在最近 3 个月内在 Hive 中找到第一个值

How do I find first value in every last 3 months in Hive

我有一个 table 如下所示。

+-------+-------+--------------+---------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value |
+-------+-------+--------------+---------------+
| AE    | A1    | 2019-12-01   |             1 |
| AE    | A1    | 2020-01-01   |             3 |
| AE    | A1    | 2020-02-01   |             5 |
| AE    | A1    | 2020-03-01   |             7 |
| AE    | A1    | 2020-04-01   |            12 |
| AE    | A1    | 2020-05-01   |            14 |
| AE    | A1    | 2020-06-01   |            11 |
| AE    | A1    | 2020-07-01   |            15 |
+-------+-------+--------------+---------------+

我需要从最后日期开始的过去 3 个月中获取第一个 Refresh_value(基于 Refresh_date),并且应该有 2 个附加列(GroupRefresh_Value_Min) 其中第一列将包含最近 3 个月的第一个值,另一列将包含说明这些日期属于哪个组的值。

预期输出

+-------+-------+--------------+---------------+-------+-------------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value | Group | Refresh_Value_Min |
+-------+-------+--------------+---------------+-------+-------------------+
| AE    | A1    | 2019-12-01   |             1 | Grp3  |                 1 |
| AE    | A1    | 2020-01-01   |             3 | Grp3  |                 1 |
| AE    | A1    | 2020-02-01   |             5 | Grp2  |                 5 |
| AE    | A1    | 2020-03-01   |             7 | Grp2  |                 5 |
| AE    | A1    | 2020-04-01   |            12 | Grp2  |                 5 |
| AE    | A1    | 2020-05-01   |            14 | Grp1  |                14 |
| AE    | A1    | 2020-06-01   |            11 | Grp1  |                14 |
| AE    | A1    | 2020-07-01   |            15 | Grp1  |                14 |
+-------+-------+--------------+---------------+-------+-------------------+

我尝试了下面的代码,它将在当前行中给出上个月第 3 个值,但我需要像上面那样的输出。

first_value(Refresh_Value) over (partition by col_1,col_2 order by Refresh_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

有人可以帮忙吗。

如有任何问题,请告诉我。

让我解释一下方法(微小的细节可能有所不同):

  1. 获取每一行的最后日期
with data_with_last_dt as (
  select Col_1, Col_2, Refresh_Date, Refresh_Value,
         max(Refresh_Date) over (partition by Col_1, Col_2) as Last_Date
    from target_table
),
  1. 得到月份差值并除以 3(整数除法)——你将得到组号
data_with_group as (
  select Col_1, Col_2, Refresh_Date, Refresh_Value,
         cast(months_between(Last_Date, Refresh_Date) as int) / 3 as Group_Id
    from data_with_last_dt 
)
  1. 在每个组中找到第一个 Refresh_Value
select Col_1, Col_2, Refresh_Date, Refresh_Value, Group_Id,
       min(Refresh_Value) over(partition by Col_1, Col_2, Group_Id order by Refresh_Date) as Refresh_Value_Min 
  from data_with_group 

首先找到群组:

tbl1:
select col1, col2, refrech_date, refresh_value,
       cast((row_number() over (partition by col1, col2 order by refresh_date desc)/3 as int) as group
  from table

然后找到最小值

tbl2:
select col1, col2, refrech_date, refresh_value, group
  from tbl1
 where refresh_date = min(refresh_date) over (partition by col1, col2, group)

然后加入最小值

tbl3:
select t1.col1, t1.col2, t1.refrech_date, t1.refresh_value, t1.group, t2.refresh_value as refresh_value_min
  from tbl1 t1
  join tbl2 t2
    on(t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.group = t2.group)