Big Query - 连续计算开始和结束日期
Big Query - Calculate start and end date back to back
我有一个问题,我需要一些建议,我需要计算在大查询中连续休假的日历天数。 (例如,在 07-01-2020
到 10-01-2020
和 13-01-2020
到 15-01-2020
之间的 2 个休假记录应该 return 07-01-2020
到 15-01-2020
)
然而,有些周,因为那周有 public 假期,所以会在 3/4 天的间隔内休假。任何人都可以建议解决这个问题吗?我为 public 假期创建了一个 table,但我仍然坚持如何将 public 假期的周视为背靠背。我考虑过 window 函数,但我不确定什么是正确的逻辑。
原始数据集
personnel_number
start_date
end_date
next_start_date
next_end_date
days_between_next_row
remarks
100100
16/1/2020
17/1/2020
20/1/2020
24/1/2020
3
100100
20/1/2020
24/1/2020
28/1/2020
31/1/2020
4
"public holiday on 27-Jan"
100100
28/1/2020
31/1/2020
10/2/2020
13/2/2020
10
100100
10/2/2020
13/2/2020
NULL
NULL
Public 假期 Table
pub_start_date
pub_end_date
remarks
25/1/2020
27/1/2020
"CNY Holiday"
期望的结果
personnel_number
start_date
back_to_back_end_date
100100
16/1/2020
31/1/2020
100100
10/2/2020
13/2/2020
以下适用于 BigQuery 标准 SQL
#standardSQL
with temp as (
-- all pto days from original table
select personnel_number, day, '1' type from `project.dataset.table`,
unnest(generate_date_array(start_date, end_date)) day
union distinct -- add weekend days if last pto day is friday
select personnel_number, day, '0' type from `project.dataset.table`,
unnest([] || if(extract(dayofweek from end_date) = 6, [end_date + 1, end_date + 2], [])) day
union distinct -- all holiday days from holidays table
select personnel_number, day, '0' from (select distinct personnel_number from `project.dataset.table`),
(select day from holidays, unnest(generate_date_array(pub_start_date, pub_end_date)) day)
union distinct -- add weekend days to holidays if last day of hliday is friday
select personnel_number, day, '0' from (select distinct personnel_number from `project.dataset.table`),
(select day from holidays, unnest([] || if(extract(dayofweek from pub_end_date) = 6, [pub_end_date + 1, pub_end_date + 2], [])) day)
)
select personnel_number,
start_date + start_tail as start_date, -- removing leading non pto days
back_to_back_end_date - end_tail as back_to_back_end_date -- removing trailing non pto days
from (
select personnel_number,
min(day) start_date,
max(day) back_to_back_end_date,
length(regexp_extract(string_agg(type, '' order by day), r'^0*')) start_tail, -- detect number of leading non pto days (holidays or weekend days)
length(regexp_extract(string_agg(type, '' order by day), r'0*$')) end_tail, -- detect number of leading non pto days (holidays or weekend days)
regexp_contains(string_agg(type, '' order by day), r'1') valid
from (
select personnel_number, day, type, countif(flag) over(partition by personnel_number order by day) grp
from (
select *, day != 1 + ifnull(lag(day) over(partition by personnel_number order by day), day) flag
from temp
)
)
group by personnel_number, grp
)
where valid
是否适用于您问题中的样本数据
with `project.dataset.table` as (
select 100100 personnel_number, date '2020-01-16' start_date, date '2020-01-17' end_date union all
select 100100, '2020-01-20', '2020-01-24' union all
select 100100, '2020-01-28', '2020-01-31' union all
select 100101, '2020-02-10', '2020-02-13'
), holidays as (
select date '2020-01-25' pub_start_date, date '2020-01-27' pub_end_date, 'CNY Holiday' remarks
)
输出是
我有一个问题,我需要一些建议,我需要计算在大查询中连续休假的日历天数。 (例如,在 07-01-2020
到 10-01-2020
和 13-01-2020
到 15-01-2020
之间的 2 个休假记录应该 return 07-01-2020
到 15-01-2020
)
然而,有些周,因为那周有 public 假期,所以会在 3/4 天的间隔内休假。任何人都可以建议解决这个问题吗?我为 public 假期创建了一个 table,但我仍然坚持如何将 public 假期的周视为背靠背。我考虑过 window 函数,但我不确定什么是正确的逻辑。
原始数据集
personnel_number | start_date | end_date | next_start_date | next_end_date | days_between_next_row | remarks |
---|---|---|---|---|---|---|
100100 | 16/1/2020 | 17/1/2020 | 20/1/2020 | 24/1/2020 | 3 | |
100100 | 20/1/2020 | 24/1/2020 | 28/1/2020 | 31/1/2020 | 4 | "public holiday on 27-Jan" |
100100 | 28/1/2020 | 31/1/2020 | 10/2/2020 | 13/2/2020 | 10 | |
100100 | 10/2/2020 | 13/2/2020 | NULL | NULL |
Public 假期 Table
pub_start_date | pub_end_date | remarks |
---|---|---|
25/1/2020 | 27/1/2020 | "CNY Holiday" |
期望的结果
personnel_number | start_date | back_to_back_end_date |
---|---|---|
100100 | 16/1/2020 | 31/1/2020 |
100100 | 10/2/2020 | 13/2/2020 |
以下适用于 BigQuery 标准 SQL
#standardSQL
with temp as (
-- all pto days from original table
select personnel_number, day, '1' type from `project.dataset.table`,
unnest(generate_date_array(start_date, end_date)) day
union distinct -- add weekend days if last pto day is friday
select personnel_number, day, '0' type from `project.dataset.table`,
unnest([] || if(extract(dayofweek from end_date) = 6, [end_date + 1, end_date + 2], [])) day
union distinct -- all holiday days from holidays table
select personnel_number, day, '0' from (select distinct personnel_number from `project.dataset.table`),
(select day from holidays, unnest(generate_date_array(pub_start_date, pub_end_date)) day)
union distinct -- add weekend days to holidays if last day of hliday is friday
select personnel_number, day, '0' from (select distinct personnel_number from `project.dataset.table`),
(select day from holidays, unnest([] || if(extract(dayofweek from pub_end_date) = 6, [pub_end_date + 1, pub_end_date + 2], [])) day)
)
select personnel_number,
start_date + start_tail as start_date, -- removing leading non pto days
back_to_back_end_date - end_tail as back_to_back_end_date -- removing trailing non pto days
from (
select personnel_number,
min(day) start_date,
max(day) back_to_back_end_date,
length(regexp_extract(string_agg(type, '' order by day), r'^0*')) start_tail, -- detect number of leading non pto days (holidays or weekend days)
length(regexp_extract(string_agg(type, '' order by day), r'0*$')) end_tail, -- detect number of leading non pto days (holidays or weekend days)
regexp_contains(string_agg(type, '' order by day), r'1') valid
from (
select personnel_number, day, type, countif(flag) over(partition by personnel_number order by day) grp
from (
select *, day != 1 + ifnull(lag(day) over(partition by personnel_number order by day), day) flag
from temp
)
)
group by personnel_number, grp
)
where valid
是否适用于您问题中的样本数据
with `project.dataset.table` as (
select 100100 personnel_number, date '2020-01-16' start_date, date '2020-01-17' end_date union all
select 100100, '2020-01-20', '2020-01-24' union all
select 100100, '2020-01-28', '2020-01-31' union all
select 100101, '2020-02-10', '2020-02-13'
), holidays as (
select date '2020-01-25' pub_start_date, date '2020-01-27' pub_end_date, 'CNY Holiday' remarks
)
输出是