SQL 如何从两列中提取不同的数据
SQL How to Extracting different data from two columns
我有一个带有报告日期的 EMPLOYEE 视图,该视图每天捕获员工信息多年。
作为初学者,我已经写了这个 sql (SAP HANA),但事实证明这是非常非常缓慢的,并且占用大量资源,尽管它可以工作但无法使用。
我想在两个报告日期之间进行捕获 - 所有在职员工的最后一个日期和所有结束日期晚于第一个报告日期和他们被撤回的第一个日期的离职员工。
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
FROM “EMPLOYEETABLE”
WHERE REPORTDATE>= Date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Active'
GROUP BY EMPLOYEEID
) r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =r. EMPLOYEEID AND E.REPORTDATE=r.MaxDate
UNION ALL
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
SELECT EMPLOYEEID, Min(REPORTDATE) as MinDate
FROM "EMPLOYEETABLE"
WHERE REPORTDATE>= date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Withdrawn' AND ENDDATE>= Date’2019-04-01’
GROUP BY EMPLOYEEID
) w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =w. EMPLOYEEID AND E.REPORTDATE=w.MinDate
谁能帮忙写得更有效率,好吗?
示例数据集
Reportdate EmployeeID startdate enddate status
01/04/2019 Steve 12/02/2012 Null Active
01/04/2019 Don 15/06/2016 Null Active
01/04/2019 John 14/03/2015 01/04/2019 Withdrawn
01/04/2019 Anna 12/05/2017 Null Active
02/04/2019 Steve 12/02/2012 Null Active
02/04/2019 Don 15/06/2016 Null Active
02/04/2019 John 14/03/2015 01/04/2019 Withdrawn
02/04/2019 Anna 12/05/2017 Null Active
03/04/2019 Steve 12/02/2012 Null Active
03/04/2019 Don 15/06/2016 Null Active
03/04/2019 John 14/03/2015 01/04/2019 Withdrawn
03/04/2019 Anna 12/05/2017 03/04/2019 Withdrawn
期望的输出
Reportdate EmployeeID startdate enddate status
03/04/2019 Steve 12/02/2012 Null Active
03/04/2019 Don 15/06/2016 Null Active
01/04/2019 Jon 14/03/2015 01/04/2019 Withdrawn
03/04/2019 Anna 12/05/2017 03/04/2019 Withdrawn
显然我的数据集非常大,因为每天都会出现同一名员工,并且还会添加新员工。
我假设您可以创建临时表并在 SqlServer 中提供解决方案。
我还假设您的查询中以下内容是错误的:
最后一行:
E. EMPLOYEEID =r.EMPLOYEEID
是
E. EMPLOYEEID =w.EMPLOYEEID
第二个子查询:
AND ENDDATE>= Date’2019-07-01’
是
AND ENDDATE<= Date’2019-07-01’
所以要划分,您应该将子查询的结果存储到临时存储中,如下所示:
IF OBJECT_ID('tempdb..#minreportdate') IS NOT NULL
DROP TABLE #minreportdate
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
INTO #minreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01'
AND STATUS = 'Active'
GROUP BY EMPLOYEEID
和
IF OBJECT_ID('tempdb..#maxreportdate') IS NOT NULL
DROP TABLE #maxreportdate
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
INTO #maxreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01'
AND STATUS = 'Withdrawn'
GROUP BY EMPLOYEEID
然后执行访问您的两个临时表的主查询:
SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #minreportdate r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID = r.EMPLOYEEID
AND E.REPORTDATE = r.MaxDate
UNION ALL
SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #maxreportdate w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID = w.EMPLOYEEID
AND E.REPORTDATE = w.MinDate
如 davidc2p 所示,将查询拆分为多个单元,每个单元涵盖 report/query-logic 的特定方面,这是个好主意。
但是临时 table 不需要这样做;通用 table 表达式 (CTE) 又名 "WITH CLAUSE" 就足够了。
查询结构
真正重要的洞察是看到 EMPLOYEETABLE
是一个快照 table,每个日期都捕获所有员工的 STATUS
。
对于查询,一般情况下只考虑一定时间范围内的快照。
基于这个 "timeboxed" 数据集,查询现在处理员工(而不是快照!)和他们的最新状态(在“时间盒数据集”中)。
此观察可以轻松确定每个员工的 MAX()
-STATUS
。由于 "WIDTHDRAWN" 员工有一个特殊条件,即他们各自的 ENDDATE
需要在报告时间范围开始时或之后,因此两个不同的员工 groups/sets/cohorts 需要他们自己的子查询。
这两个子查询 return 每个员工的记录(EMPLOYEEID
+REPORTDATE
作为报告记录的唯一键)应该作为询问。
为了生成输出,两个员工组被合并 (UNION ALL
),然后用作所有记录的最终 filter/selector 从基础 returned =56=].
重写的查询
with report_base as (
-- all records relevant to the reporting timeframe)
select
reportdate, EmployeeID
, startdate, enddate, status
from
employeetable
where
reportdate >= date'2019-04-01'
and reportdate <= date'2019-07-01')
, active_employees as (
-- all employees with most recent status in reporting timeframe = ACTIVE
-- should be DISJUNCT from WITHDRAWN_EMPLOYEES
select
employeeid
, max(reportdate) as reportdate
from
report_base
group by
employeeid
having
max(status)='Active')
, withdrawn_employees as (
-- all employees with most recent status in reporting timefrawm = WITHDRAWN
-- the ENDDATE should be on or after the start of the reporting timeframe
-- should be DISJUNCT from ACTIVE_EMPLOYEES)
select
employeeid
, min(reportdate) as reportdate
from
report_base
where
enddate >= date'2019-04-01'
group by
employeeid
having
max(status)='Withdrawn')
, report_records as(
-- all records that should be returned)
select
employeeid, reportdate
from
active_employees
union all
select
employeeid, reportdate
from
withdrawn_employees)
select
rb.reportdate, rb.EmployeeID
, rb.startdate, rb.enddate, rb.status
from
report_base rb
inner join report_records rr
on (rb.employeeid, rb.reportdate)
= (rr.employeeid, rr.reportdate);
为什么更好?
由于没有可用的体积测试数据,我无法检查 OP 查询与重构版本之间的实际运行时性能差异。
但是,重构版本导致 EXPLAIN PLAN 中少了一个连接,这可能会转化为性能和内存使用改进。
除此之外,重构计划在如何计算结果数据方面更加清晰,并允许逐步 development/debugging。
我有一个带有报告日期的 EMPLOYEE 视图,该视图每天捕获员工信息多年。
作为初学者,我已经写了这个 sql (SAP HANA),但事实证明这是非常非常缓慢的,并且占用大量资源,尽管它可以工作但无法使用。
我想在两个报告日期之间进行捕获 - 所有在职员工的最后一个日期和所有结束日期晚于第一个报告日期和他们被撤回的第一个日期的离职员工。
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
FROM “EMPLOYEETABLE”
WHERE REPORTDATE>= Date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Active'
GROUP BY EMPLOYEEID
) r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =r. EMPLOYEEID AND E.REPORTDATE=r.MaxDate
UNION ALL
select E.EMPLOYEEID
, E.STATUS
, E.STARTDATE
, E.ENDDATE
, E.REPORTDATE
FROM
(
SELECT EMPLOYEEID, Min(REPORTDATE) as MinDate
FROM "EMPLOYEETABLE"
WHERE REPORTDATE>= date’2019-04-01’ AND REPORTDATE<=Date’2019-07-01’ AND STATUS='Withdrawn' AND ENDDATE>= Date’2019-04-01’
GROUP BY EMPLOYEEID
) w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID =w. EMPLOYEEID AND E.REPORTDATE=w.MinDate
谁能帮忙写得更有效率,好吗?
示例数据集
Reportdate EmployeeID startdate enddate status
01/04/2019 Steve 12/02/2012 Null Active
01/04/2019 Don 15/06/2016 Null Active
01/04/2019 John 14/03/2015 01/04/2019 Withdrawn
01/04/2019 Anna 12/05/2017 Null Active
02/04/2019 Steve 12/02/2012 Null Active
02/04/2019 Don 15/06/2016 Null Active
02/04/2019 John 14/03/2015 01/04/2019 Withdrawn
02/04/2019 Anna 12/05/2017 Null Active
03/04/2019 Steve 12/02/2012 Null Active
03/04/2019 Don 15/06/2016 Null Active
03/04/2019 John 14/03/2015 01/04/2019 Withdrawn
03/04/2019 Anna 12/05/2017 03/04/2019 Withdrawn
期望的输出
Reportdate EmployeeID startdate enddate status
03/04/2019 Steve 12/02/2012 Null Active
03/04/2019 Don 15/06/2016 Null Active
01/04/2019 Jon 14/03/2015 01/04/2019 Withdrawn
03/04/2019 Anna 12/05/2017 03/04/2019 Withdrawn
显然我的数据集非常大,因为每天都会出现同一名员工,并且还会添加新员工。
我假设您可以创建临时表并在 SqlServer 中提供解决方案。
我还假设您的查询中以下内容是错误的:
最后一行:
E. EMPLOYEEID =r.EMPLOYEEID
是
E. EMPLOYEEID =w.EMPLOYEEID
第二个子查询:
AND ENDDATE>= Date’2019-07-01’
是
AND ENDDATE<= Date’2019-07-01’
所以要划分,您应该将子查询的结果存储到临时存储中,如下所示:
IF OBJECT_ID('tempdb..#minreportdate') IS NOT NULL
DROP TABLE #minreportdate
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
INTO #minreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01'
AND STATUS = 'Active'
GROUP BY EMPLOYEEID
和
IF OBJECT_ID('tempdb..#maxreportdate') IS NOT NULL
DROP TABLE #maxreportdate
SELECT EMPLOYEEID, Max(REPORTDATE) as MaxDate
INTO #maxreportdate
FROM EMPLOYEETABLE
WHERE REPORTDATE BETWEEN '2019-04-01' AND '2019-07-01'
AND STATUS = 'Withdrawn'
GROUP BY EMPLOYEEID
然后执行访问您的两个临时表的主查询:
SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #minreportdate r
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID = r.EMPLOYEEID
AND E.REPORTDATE = r.MaxDate
UNION ALL
SELECT E.EMPLOYEEID, E.STATUS, E.STARTDATE, E.ENDDATE, E.REPORTDATE
FROM #maxreportdate w
LEFT JOIN "EMPLOYEETABLE" E
ON E. EMPLOYEEID = w.EMPLOYEEID
AND E.REPORTDATE = w.MinDate
如 davidc2p 所示,将查询拆分为多个单元,每个单元涵盖 report/query-logic 的特定方面,这是个好主意。
但是临时 table 不需要这样做;通用 table 表达式 (CTE) 又名 "WITH CLAUSE" 就足够了。
查询结构
真正重要的洞察是看到 EMPLOYEETABLE
是一个快照 table,每个日期都捕获所有员工的 STATUS
。
对于查询,一般情况下只考虑一定时间范围内的快照。
基于这个 "timeboxed" 数据集,查询现在处理员工(而不是快照!)和他们的最新状态(在“时间盒数据集”中)。
此观察可以轻松确定每个员工的 MAX()
-STATUS
。由于 "WIDTHDRAWN" 员工有一个特殊条件,即他们各自的 ENDDATE
需要在报告时间范围开始时或之后,因此两个不同的员工 groups/sets/cohorts 需要他们自己的子查询。
这两个子查询 return 每个员工的记录(EMPLOYEEID
+REPORTDATE
作为报告记录的唯一键)应该作为询问。
为了生成输出,两个员工组被合并 (UNION ALL
),然后用作所有记录的最终 filter/selector 从基础 returned =56=].
重写的查询
with report_base as (
-- all records relevant to the reporting timeframe)
select
reportdate, EmployeeID
, startdate, enddate, status
from
employeetable
where
reportdate >= date'2019-04-01'
and reportdate <= date'2019-07-01')
, active_employees as (
-- all employees with most recent status in reporting timeframe = ACTIVE
-- should be DISJUNCT from WITHDRAWN_EMPLOYEES
select
employeeid
, max(reportdate) as reportdate
from
report_base
group by
employeeid
having
max(status)='Active')
, withdrawn_employees as (
-- all employees with most recent status in reporting timefrawm = WITHDRAWN
-- the ENDDATE should be on or after the start of the reporting timeframe
-- should be DISJUNCT from ACTIVE_EMPLOYEES)
select
employeeid
, min(reportdate) as reportdate
from
report_base
where
enddate >= date'2019-04-01'
group by
employeeid
having
max(status)='Withdrawn')
, report_records as(
-- all records that should be returned)
select
employeeid, reportdate
from
active_employees
union all
select
employeeid, reportdate
from
withdrawn_employees)
select
rb.reportdate, rb.EmployeeID
, rb.startdate, rb.enddate, rb.status
from
report_base rb
inner join report_records rr
on (rb.employeeid, rb.reportdate)
= (rr.employeeid, rr.reportdate);
为什么更好?
由于没有可用的体积测试数据,我无法检查 OP 查询与重构版本之间的实际运行时性能差异。
但是,重构版本导致 EXPLAIN PLAN 中少了一个连接,这可能会转化为性能和内存使用改进。
除此之外,重构计划在如何计算结果数据方面更加清晰,并允许逐步 development/debugging。