Big Query - 连续计算开始和结束日期

Big Query - Calculate start and end date back to back

我有一个问题,我需要一些建议,我需要计算在大查询中连续休假的日历天数。 (例如,在 07-01-202010-01-202013-01-202015-01-2020 之间的 2 个休假记录应该 return 07-01-202015-01-2020 )

然而,有些周,因为那周有 public 假期,所以会在 3/4 天的间隔内休假。任何人都可以建议解决这个问题吗?我为 public 假期创建了一个 table,但我仍然坚持如何将 public 假期的周视为背靠背。我考虑过 window 函数,但我不确定什么是正确的逻辑。

原始数据集

personnel_number start_date end_date next_start_date next_end_date days_between_next_row remarks
100100 16/1/2020 17/1/2020 20/1/2020 24/1/2020 3
100100 20/1/2020 24/1/2020 28/1/2020 31/1/2020 4 "public holiday on 27-Jan"
100100 28/1/2020 31/1/2020 10/2/2020 13/2/2020 10
100100 10/2/2020 13/2/2020 NULL NULL

Public 假期 Table

pub_start_date pub_end_date remarks
25/1/2020 27/1/2020 "CNY Holiday"

期望的结果

personnel_number start_date back_to_back_end_date
100100 16/1/2020 31/1/2020
100100 10/2/2020 13/2/2020

以下适用于 BigQuery 标准 SQL

#standardSQL
with temp as (
  -- all pto days from original table
  select personnel_number, day, '1' type from `project.dataset.table`, 
  unnest(generate_date_array(start_date, end_date)) day
  
  union distinct -- add weekend days if last pto day is friday
  select personnel_number, day, '0' type from `project.dataset.table`, 
  unnest([] || if(extract(dayofweek from end_date) = 6, [end_date + 1, end_date + 2], [])) day
  
  union distinct -- all holiday days from holidays table 
  select personnel_number, day, '0' from (select distinct personnel_number from `project.dataset.table`), 
  (select day from holidays, unnest(generate_date_array(pub_start_date, pub_end_date)) day)
  
  union distinct -- add weekend days to holidays if last day of hliday is friday 
  select personnel_number, day, '0' from (select distinct personnel_number from `project.dataset.table`), 
  (select day from holidays, unnest([] || if(extract(dayofweek from pub_end_date) = 6, [pub_end_date + 1, pub_end_date + 2], [])) day) 
)
select personnel_number,
  start_date + start_tail as start_date,                     -- removing leading non pto days
  back_to_back_end_date - end_tail as back_to_back_end_date  -- removing trailing non pto days
from (
  select personnel_number, 
    min(day) start_date, 
    max(day) back_to_back_end_date, 
    length(regexp_extract(string_agg(type, '' order by day), r'^0*')) start_tail, -- detect number of leading non pto days (holidays or weekend days)
    length(regexp_extract(string_agg(type, '' order by day), r'0*$')) end_tail,   -- detect number of leading non pto days (holidays or weekend days)
    regexp_contains(string_agg(type, '' order by day), r'1') valid
  from (
    select personnel_number, day, type, countif(flag) over(partition by personnel_number order by day) grp
    from (
      select *, day != 1 + ifnull(lag(day) over(partition by personnel_number order by day), day) flag 
      from temp
    )
  )
  group by personnel_number, grp
)
where valid

是否适用于您问题中的样本数据

with `project.dataset.table` as (
  select 100100 personnel_number, date '2020-01-16' start_date, date '2020-01-17' end_date union all
  select 100100, '2020-01-20', '2020-01-24' union all
  select 100100, '2020-01-28', '2020-01-31' union all
  select 100101, '2020-02-10', '2020-02-13'
), holidays as (
  select date '2020-01-25' pub_start_date, date '2020-01-27' pub_end_date, 'CNY Holiday' remarks 
)    

输出是