Hive SQL 查询以用最接近的值填充 table 中缺失的日期值

Hive SQL query to fill missing date values in table with nearest values

我花了几天时间试图弄清楚如何在 Hive 中添加具有最接近值的缺失日期,但没有成功。基于环境限制,我需要为此使用 Hive SQL。原始 table 目前看起来像下面 table.

account name,available balance,Date of balance 

Peter,50000,2021-05-24
Peter,50035,2021-05-25
Peter,50035,2021-05-26
Peter,50610,2021-05-28
Peter,51710,2021-06-01
Peter,53028.1,2021-06-02
Peter,53916.1,2021-06-03
Mary,50000,2021-05-24
Mary,50035,2021-05-25
Mary,53028.1,2021-05-30

Raw balance table

我需要的是把上面的table转换成下面的tablelink:

account name,available balance,Date of balance 

Peter,50000,2021-05-24
Peter,50035,2021-05-25
Peter,50035,2021-05-26
Peter,50035,2021-05-27
Peter,50610,2021-05-28
Peter,50610,2021-05-29
Peter,50610,2021-05-30
Mary,50000,2021-05-24
Mary,50035,2021-05-25
Mary,50035,2021-05-26
Mary,50035,2021-05-27
Mary,50035,2021-05-28
Mary,50035,2021-05-29
Mary,53028.1,2021-05-30

Converted table

任何人都可以分享 Hive SQL 逻辑来进行此更改吗?

使用 lead() 函数获取下一个日期,计算天数差异,获取长度为天数差异的空格字符串,拆分,使用 posexplode 生成行,使用位置添加到日期以获取缺失日期:

with mytable as (--Demo dataset, use your table instead of this
select stack(10, --number of tuples
'Peter',float(50000),'2021-05-24',
'Peter',float(50035),'2021-05-25',
'Peter',float(50035),'2021-05-26',
'Peter',float(50610),'2021-05-28',
'Peter',float(51710),'2021-06-01',
'Peter',float(53028.1),'2021-06-02',
'Peter',float(53916.1),'2021-06-03',
'Mary',float(50000),'2021-05-24',
'Mary',float(50035),'2021-05-25',
'Mary',float(53028.1),'2021-05-30'
) as (account_name,available_balance,Date_of_balance)
) --use your table instead of this CTE

select  account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
from
( --Get next_date to generate date range
select account_name,available_balance,Date_of_balance,
       lead(Date_of_balance,1, Date_of_balance) over (partition by account_name order by Date_of_balance) next_date    
  from mytable d  --use your table
) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table

结果:

account_name    available_balance   date_of_balance 
Peter           50000                2021-05-24
Peter           50035                2021-05-25
Peter           50035                2021-05-26
Peter           50035                2021-05-27
Peter           50610                2021-05-28
Peter           50610                2021-05-29
Peter           50610                2021-05-30
Peter           50610                2021-05-31
Peter           51710                2021-06-01
Peter           53028.1              2021-06-02
Peter           53916.1              2021-06-03
Mary            50000                2021-05-24
Mary            50035                2021-05-25
Mary            50035                2021-05-26
Mary            50035                2021-05-27
Mary            50035                2021-05-28
Mary            50035                2021-05-29
Mary            53028.1              2021-05-30