在 PostgreSQL 和 Microsoft SQL 服务器中解析字符串
Parsing a string in both PostgreSQL and Micrsoft SQL Server
我有以下数据(所有 table DDL 和数据 DML 在 fiddle here (SQL Server) and here (PostgreSQL) 上可用:
我已经有了解决方案,这个问题是关于效率的,最佳的解决方法是什么?
CREATE TABLE ticket
(
ticket_id INTEGER NOT NULL,
working_time VARCHAR (30) NULL DEFAULT NULL CHECK (working_time != '')
);
和数据:
ticket_id working_time
18 20.02.2021,15:00,17:00
18 20.02.2021,15:00,17:00
18 20.02.2021,15:00,17:00
20 20.02.2021,12:00,14:15
20 _rubbish__ -- <--- deliberate
20 20.02.2021,12:00,14:15
20
20 21.02.2021,12:00,14:15
20 _rubbish__
20 21.02.2021,12:00,14:15
20
11 rows
数据中的 _rubbish__
条目是故意的 - 它是自由文本,我必须能够处理糟糕的数据!
现在,我想要这样的结果:
Ticket ID The date hrs_worked_per_ticket
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
无需告诉我该架构令人震惊 - 我发现在一行中存储日期(以非 ISO 格式)和类似时间的想法令人厌恶!这件事没得选择。
我对 PostgreSQL 和 SQL 服务器都有自己的答案(见下文),但我想知道是否有更有效的方法?
我有一个 SQL 服务器解决方案 here(太可怕了!):
WITH cte AS
(
SELECT
ticket_id,
CAST
(
TRY_CONVERT
(
DATE,
SUBSTRING(working_time, 7, 4) + '.' +
SUBSTRING(working_time, 4, 2) + '.' +
SUBSTRING(working_time, 1, 2)
) AS DATETIME
)
+
CAST
(
CAST
(
SUBSTRING
(
working_time, 12, 5
) AS TIME
) AS DATETIME
) AS st_dt,
CAST
(
TRY_CONVERT
(
DATE,
SUBSTRING(working_time, 7, 4) + '.' +
SUBSTRING(working_time, 4, 2) + '.' +
SUBSTRING(working_time, 1, 2)
) AS DATETIME
)
+
CAST
(
CAST
(
SUBSTRING
(
working_time, 18, 5
) AS TIME
) AS DATETIME
) AS et_dt
FROM
ticket
)
SELECT
ticket_id AS "Ticket ID",
TRY_CONVERT(date, et_dt) AS "The date",
TRY_CONVERT
(
VARCHAR(8),
dateadd
(
second,
COALESCE(SUM
(
DATEDIFF(SECOND, st_dt, et_dt)
), 0),
0
),
108
) AS hrs_worked_per_ticket
FROM
cte
WHERE TRY_CONVERT(DATE, et_dt) IS NOT NULL
GROUP BY ticket_id, TRY_CONVERT(DATE, et_dt)
ORDER BY ticket_id, TRY_CONVERT(DATE, et_dt);
结果:
Ticket ID The date hrs_worked_per_ticket
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
我有一个 PostgreSQL 解决方案here - try_cast_time
and try_cast_date
are functions that I wrote, inspired by this post(整个线程很有帮助!):
SELECT DISTINCT
ticket_id,
try_cast_date(working_time)::DATE,
SUM((try_cast_date(working_time) + try_cast_time(working_time, 18, 5)) -
(try_cast_date(working_time) + try_cast_time(working_time, 12, 5)))
OVER (PARTITION BY ticket_id, try_cast_date(working_time)::DATE)
AS ts_diff
FROM ticket
WHERE try_cast_date(working_time)::DATE IS NOT NULL
ORDER BY ticket_id, try_cast_date(working_time)::DATE
结果:
ticket_id try_cast_date ts_diff
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
所以你有一个工作版本,尽管它很笨拙,但不一定因为它的冗长而表现不佳。
但是您询问是否有更有效的方法,对于 SQL 服务器(我无法对 Postgres 发表评论)您可以通过添加 persisted 计算列来极大地简化和提高性能 和日期的支持指数。
这消除了查询的不可搜索性,并允许优化器充分利用索引进行过滤和聚合,并避免了解析和转换字符串值的最小开销,因为现在当行是 inserted/updated.
添加计算列:
alter table ticket add WorkingDate as Try_convert(date,Concat(Substring(working_time, 7, 4),SUBSTRING(working_time, 4, 2),SUBSTRING(working_time, 1, 2)),112) persisted
alter table ticket add WorkingDuration as DateDiff(minute,Try_convert(time,Substring (working_time, 12, 5),114 ) , Try_convert(time, Substring (working_time, 18, 5),114 )) persisted
添加支持索引
create clustered index Ix_Id_WorkingDuration on ticket(ticket_id,workingdate)
然后您的查询变为:
with w as (
select ticket_Id, workingDate, Sum(workingDuration) d
from ticket
group by ticket_id, workingDate
)
select ticket_id,
workingdate as [The date],
format(d / 60 * 100 + d % 60, '#:0#') hrs_worked_per_ticket
from w
where d>0;
与您的原始查询相比,不会在如此少的几行上产生任何明显的改进,但在大型数据集上的表现会明显更好,特别是如果您需要按日期或范围进一步过滤。
然而,估计的执行计划建议此版本为 18%,而原始版本为 82%。
我有以下数据(所有 table DDL 和数据 DML 在 fiddle here (SQL Server) and here (PostgreSQL) 上可用:
我已经有了解决方案,这个问题是关于效率的,最佳的解决方法是什么?
CREATE TABLE ticket
(
ticket_id INTEGER NOT NULL,
working_time VARCHAR (30) NULL DEFAULT NULL CHECK (working_time != '')
);
和数据:
ticket_id working_time
18 20.02.2021,15:00,17:00
18 20.02.2021,15:00,17:00
18 20.02.2021,15:00,17:00
20 20.02.2021,12:00,14:15
20 _rubbish__ -- <--- deliberate
20 20.02.2021,12:00,14:15
20
20 21.02.2021,12:00,14:15
20 _rubbish__
20 21.02.2021,12:00,14:15
20
11 rows
数据中的 _rubbish__
条目是故意的 - 它是自由文本,我必须能够处理糟糕的数据!
现在,我想要这样的结果:
Ticket ID The date hrs_worked_per_ticket
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
无需告诉我该架构令人震惊 - 我发现在一行中存储日期(以非 ISO 格式)和类似时间的想法令人厌恶!这件事没得选择。
我对 PostgreSQL 和 SQL 服务器都有自己的答案(见下文),但我想知道是否有更有效的方法?
我有一个 SQL 服务器解决方案 here(太可怕了!):
WITH cte AS
(
SELECT
ticket_id,
CAST
(
TRY_CONVERT
(
DATE,
SUBSTRING(working_time, 7, 4) + '.' +
SUBSTRING(working_time, 4, 2) + '.' +
SUBSTRING(working_time, 1, 2)
) AS DATETIME
)
+
CAST
(
CAST
(
SUBSTRING
(
working_time, 12, 5
) AS TIME
) AS DATETIME
) AS st_dt,
CAST
(
TRY_CONVERT
(
DATE,
SUBSTRING(working_time, 7, 4) + '.' +
SUBSTRING(working_time, 4, 2) + '.' +
SUBSTRING(working_time, 1, 2)
) AS DATETIME
)
+
CAST
(
CAST
(
SUBSTRING
(
working_time, 18, 5
) AS TIME
) AS DATETIME
) AS et_dt
FROM
ticket
)
SELECT
ticket_id AS "Ticket ID",
TRY_CONVERT(date, et_dt) AS "The date",
TRY_CONVERT
(
VARCHAR(8),
dateadd
(
second,
COALESCE(SUM
(
DATEDIFF(SECOND, st_dt, et_dt)
), 0),
0
),
108
) AS hrs_worked_per_ticket
FROM
cte
WHERE TRY_CONVERT(DATE, et_dt) IS NOT NULL
GROUP BY ticket_id, TRY_CONVERT(DATE, et_dt)
ORDER BY ticket_id, TRY_CONVERT(DATE, et_dt);
结果:
Ticket ID The date hrs_worked_per_ticket
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
我有一个 PostgreSQL 解决方案here - try_cast_time
and try_cast_date
are functions that I wrote, inspired by this post(整个线程很有帮助!):
SELECT DISTINCT
ticket_id,
try_cast_date(working_time)::DATE,
SUM((try_cast_date(working_time) + try_cast_time(working_time, 18, 5)) -
(try_cast_date(working_time) + try_cast_time(working_time, 12, 5)))
OVER (PARTITION BY ticket_id, try_cast_date(working_time)::DATE)
AS ts_diff
FROM ticket
WHERE try_cast_date(working_time)::DATE IS NOT NULL
ORDER BY ticket_id, try_cast_date(working_time)::DATE
结果:
ticket_id try_cast_date ts_diff
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
所以你有一个工作版本,尽管它很笨拙,但不一定因为它的冗长而表现不佳。
但是您询问是否有更有效的方法,对于 SQL 服务器(我无法对 Postgres 发表评论)您可以通过添加 persisted 计算列来极大地简化和提高性能 和日期的支持指数。
这消除了查询的不可搜索性,并允许优化器充分利用索引进行过滤和聚合,并避免了解析和转换字符串值的最小开销,因为现在当行是 inserted/updated.
添加计算列:
alter table ticket add WorkingDate as Try_convert(date,Concat(Substring(working_time, 7, 4),SUBSTRING(working_time, 4, 2),SUBSTRING(working_time, 1, 2)),112) persisted
alter table ticket add WorkingDuration as DateDiff(minute,Try_convert(time,Substring (working_time, 12, 5),114 ) , Try_convert(time, Substring (working_time, 18, 5),114 )) persisted
添加支持索引
create clustered index Ix_Id_WorkingDuration on ticket(ticket_id,workingdate)
然后您的查询变为:
with w as (
select ticket_Id, workingDate, Sum(workingDuration) d
from ticket
group by ticket_id, workingDate
)
select ticket_id,
workingdate as [The date],
format(d / 60 * 100 + d % 60, '#:0#') hrs_worked_per_ticket
from w
where d>0;
与您的原始查询相比,不会在如此少的几行上产生任何明显的改进,但在大型数据集上的表现会明显更好,特别是如果您需要按日期或范围进一步过滤。
然而,估计的执行计划建议此版本为 18%,而原始版本为 82%。