我如何在最近 3 个月内在 Hive 中找到第一个值
How do I find first value in every last 3 months in Hive
我有一个 table 如下所示。
+-------+-------+--------------+---------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value |
+-------+-------+--------------+---------------+
| AE | A1 | 2019-12-01 | 1 |
| AE | A1 | 2020-01-01 | 3 |
| AE | A1 | 2020-02-01 | 5 |
| AE | A1 | 2020-03-01 | 7 |
| AE | A1 | 2020-04-01 | 12 |
| AE | A1 | 2020-05-01 | 14 |
| AE | A1 | 2020-06-01 | 11 |
| AE | A1 | 2020-07-01 | 15 |
+-------+-------+--------------+---------------+
我需要从最后日期开始的过去 3 个月中获取第一个 Refresh_value
(基于 Refresh_date
),并且应该有 2 个附加列(Group
和 Refresh_Value_Min
) 其中第一列将包含最近 3 个月的第一个值,另一列将包含说明这些日期属于哪个组的值。
预期输出
+-------+-------+--------------+---------------+-------+-------------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value | Group | Refresh_Value_Min |
+-------+-------+--------------+---------------+-------+-------------------+
| AE | A1 | 2019-12-01 | 1 | Grp3 | 1 |
| AE | A1 | 2020-01-01 | 3 | Grp3 | 1 |
| AE | A1 | 2020-02-01 | 5 | Grp2 | 5 |
| AE | A1 | 2020-03-01 | 7 | Grp2 | 5 |
| AE | A1 | 2020-04-01 | 12 | Grp2 | 5 |
| AE | A1 | 2020-05-01 | 14 | Grp1 | 14 |
| AE | A1 | 2020-06-01 | 11 | Grp1 | 14 |
| AE | A1 | 2020-07-01 | 15 | Grp1 | 14 |
+-------+-------+--------------+---------------+-------+-------------------+
我尝试了下面的代码,它将在当前行中给出上个月第 3 个值,但我需要像上面那样的输出。
first_value(Refresh_Value) over (partition by col_1,col_2 order by Refresh_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
有人可以帮忙吗。
如有任何问题,请告诉我。
让我解释一下方法(微小的细节可能有所不同):
- 获取每一行的最后日期
with data_with_last_dt as (
select Col_1, Col_2, Refresh_Date, Refresh_Value,
max(Refresh_Date) over (partition by Col_1, Col_2) as Last_Date
from target_table
),
- 得到月份差值并除以 3(整数除法)——你将得到组号
data_with_group as (
select Col_1, Col_2, Refresh_Date, Refresh_Value,
cast(months_between(Last_Date, Refresh_Date) as int) / 3 as Group_Id
from data_with_last_dt
)
- 在每个组中找到第一个
Refresh_Value
:
select Col_1, Col_2, Refresh_Date, Refresh_Value, Group_Id,
min(Refresh_Value) over(partition by Col_1, Col_2, Group_Id order by Refresh_Date) as Refresh_Value_Min
from data_with_group
首先找到群组:
tbl1:
select col1, col2, refrech_date, refresh_value,
cast((row_number() over (partition by col1, col2 order by refresh_date desc)/3 as int) as group
from table
然后找到最小值
tbl2:
select col1, col2, refrech_date, refresh_value, group
from tbl1
where refresh_date = min(refresh_date) over (partition by col1, col2, group)
然后加入最小值
tbl3:
select t1.col1, t1.col2, t1.refrech_date, t1.refresh_value, t1.group, t2.refresh_value as refresh_value_min
from tbl1 t1
join tbl2 t2
on(t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.group = t2.group)
我有一个 table 如下所示。
+-------+-------+--------------+---------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value |
+-------+-------+--------------+---------------+
| AE | A1 | 2019-12-01 | 1 |
| AE | A1 | 2020-01-01 | 3 |
| AE | A1 | 2020-02-01 | 5 |
| AE | A1 | 2020-03-01 | 7 |
| AE | A1 | 2020-04-01 | 12 |
| AE | A1 | 2020-05-01 | 14 |
| AE | A1 | 2020-06-01 | 11 |
| AE | A1 | 2020-07-01 | 15 |
+-------+-------+--------------+---------------+
我需要从最后日期开始的过去 3 个月中获取第一个 Refresh_value
(基于 Refresh_date
),并且应该有 2 个附加列(Group
和 Refresh_Value_Min
) 其中第一列将包含最近 3 个月的第一个值,另一列将包含说明这些日期属于哪个组的值。
预期输出
+-------+-------+--------------+---------------+-------+-------------------+
| Col_1 | Col_2 | Refresh_Date | Refresh_Value | Group | Refresh_Value_Min |
+-------+-------+--------------+---------------+-------+-------------------+
| AE | A1 | 2019-12-01 | 1 | Grp3 | 1 |
| AE | A1 | 2020-01-01 | 3 | Grp3 | 1 |
| AE | A1 | 2020-02-01 | 5 | Grp2 | 5 |
| AE | A1 | 2020-03-01 | 7 | Grp2 | 5 |
| AE | A1 | 2020-04-01 | 12 | Grp2 | 5 |
| AE | A1 | 2020-05-01 | 14 | Grp1 | 14 |
| AE | A1 | 2020-06-01 | 11 | Grp1 | 14 |
| AE | A1 | 2020-07-01 | 15 | Grp1 | 14 |
+-------+-------+--------------+---------------+-------+-------------------+
我尝试了下面的代码,它将在当前行中给出上个月第 3 个值,但我需要像上面那样的输出。
first_value(Refresh_Value) over (partition by col_1,col_2 order by Refresh_Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
有人可以帮忙吗。
如有任何问题,请告诉我。
让我解释一下方法(微小的细节可能有所不同):
- 获取每一行的最后日期
with data_with_last_dt as (
select Col_1, Col_2, Refresh_Date, Refresh_Value,
max(Refresh_Date) over (partition by Col_1, Col_2) as Last_Date
from target_table
),
- 得到月份差值并除以 3(整数除法)——你将得到组号
data_with_group as (
select Col_1, Col_2, Refresh_Date, Refresh_Value,
cast(months_between(Last_Date, Refresh_Date) as int) / 3 as Group_Id
from data_with_last_dt
)
- 在每个组中找到第一个
Refresh_Value
:
select Col_1, Col_2, Refresh_Date, Refresh_Value, Group_Id,
min(Refresh_Value) over(partition by Col_1, Col_2, Group_Id order by Refresh_Date) as Refresh_Value_Min
from data_with_group
首先找到群组:
tbl1:
select col1, col2, refrech_date, refresh_value,
cast((row_number() over (partition by col1, col2 order by refresh_date desc)/3 as int) as group
from table
然后找到最小值
tbl2:
select col1, col2, refrech_date, refresh_value, group
from tbl1
where refresh_date = min(refresh_date) over (partition by col1, col2, group)
然后加入最小值
tbl3:
select t1.col1, t1.col2, t1.refrech_date, t1.refresh_value, t1.group, t2.refresh_value as refresh_value_min
from tbl1 t1
join tbl2 t2
on(t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.group = t2.group)