在 PostgreSQL 和 Microsoft SQL 服务器中解析字符串

Parsing a string in both PostgreSQL and Micrsoft SQL Server

我有以下数据(所有 table DDL 和数据 DML 在 fiddle here (SQL Server) and here (PostgreSQL) 上可用:

我已经有了解决方案,这个问题是关于效率的,最佳的解决方法是什么?

CREATE TABLE ticket
(
  ticket_id INTEGER NOT NULL,
  working_time VARCHAR (30) NULL DEFAULT NULL CHECK (working_time != '')
);

和数据:

ticket_id   working_time
       18   20.02.2021,15:00,17:00
       18   20.02.2021,15:00,17:00
       18   20.02.2021,15:00,17:00
       20   20.02.2021,12:00,14:15
       20   _rubbish__              --  <---   deliberate
       20   20.02.2021,12:00,14:15  
       20   
       20   21.02.2021,12:00,14:15
       20   _rubbish__
       20   21.02.2021,12:00,14:15
       20   
11 rows

数据中的 _rubbish__ 条目是故意的 - 它是自由文本,我必须能够处理糟糕的数据!

现在,我想要这样的结果:

Ticket ID     The date  hrs_worked_per_ticket
       18   2021-02-20               06:00:00
       20   2021-02-20               04:30:00
       20   2021-02-21               04:30:00

无需告诉我该架构令人震惊 - 我发现在一行中存储日期(以非 ISO 格式)和类似时间的想法令人厌恶!这件事没得选择。

我对 PostgreSQL 和 SQL 服务器都有自己的答案(见下文),但我想知道是否有更有效的方法?

我有一个 SQL 服务器解决方案 here(太可怕了!):

WITH cte AS
(
  SELECT
    ticket_id,
    CAST
    (
      TRY_CONVERT
      (
        DATE, 
        SUBSTRING(working_time, 7, 4) + '.' + 
        SUBSTRING(working_time, 4, 2) + '.' +
        SUBSTRING(working_time, 1, 2)
      ) AS DATETIME
    ) 
    + 
    CAST
    (
      CAST
      (
        SUBSTRING
        (
          working_time, 12,  5
        ) AS TIME
      ) AS DATETIME
    ) AS st_dt,
    CAST
    (
      TRY_CONVERT
      (
        DATE, 
        SUBSTRING(working_time, 7, 4) + '.' + 
        SUBSTRING(working_time, 4, 2) + '.' +
        SUBSTRING(working_time, 1, 2)
      ) AS DATETIME
    ) 
    + 
    CAST
    (
      CAST
      (
        SUBSTRING
        (
          working_time, 18,  5
        ) AS TIME
      ) AS DATETIME
    ) AS et_dt
  FROM
    ticket
)
SELECT
  ticket_id AS "Ticket ID",
  TRY_CONVERT(date, et_dt) AS "The date",
  TRY_CONVERT
  (
    VARCHAR(8), 
    dateadd
    (
      second, 
      COALESCE(SUM
      (
        DATEDIFF(SECOND, st_dt, et_dt)
      ), 0), 
      0
    ),  
    108
  ) AS hrs_worked_per_ticket
FROM 
  cte
WHERE TRY_CONVERT(DATE, et_dt) IS NOT NULL
GROUP BY ticket_id, TRY_CONVERT(DATE, et_dt)
ORDER BY ticket_id, TRY_CONVERT(DATE, et_dt);

结果:

Ticket ID   The date    hrs_worked_per_ticket
       18   2021-02-20               06:00:00
       20   2021-02-20               04:30:00
       20   2021-02-21               04:30:00

我有一个 PostgreSQL 解决方案here - try_cast_time and try_cast_date are functions that I wrote, inspired by this post(整个线程很有帮助!):

SELECT DISTINCT
  ticket_id,
  try_cast_date(working_time)::DATE,
  SUM((try_cast_date(working_time) + try_cast_time(working_time, 18, 5)) - 
  (try_cast_date(working_time) + try_cast_time(working_time, 12, 5))) 
    OVER (PARTITION BY ticket_id, try_cast_date(working_time)::DATE)
  AS ts_diff
FROM ticket
WHERE try_cast_date(working_time)::DATE IS NOT NULL
ORDER BY ticket_id, try_cast_date(working_time)::DATE

结果:

ticket_id   try_cast_date    ts_diff
       18      2021-02-20   06:00:00
       20      2021-02-20   04:30:00
       20       2021-02-21  04:30:00

所以你有一个工作版本,尽管它很笨拙,但不一定因为它的冗长而表现不佳。

但是您询问是否有更有效的方法,对于 SQL 服务器(我无法对 Postgres 发表评论)您可以通过添加 persisted 计算列来极大地简化和提高性能 和日期的支持指数。

这消除了查询的不可搜索性,并允许优化器充分利用索引进行过滤和聚合,并避免了解析和转换字符串值的最小开销,因为现在当行是 inserted/updated.

添加计算列:

alter table ticket add WorkingDate as Try_convert(date,Concat(Substring(working_time, 7, 4),SUBSTRING(working_time, 4, 2),SUBSTRING(working_time, 1, 2)),112) persisted 
alter table ticket add WorkingDuration as  DateDiff(minute,Try_convert(time,Substring (working_time, 12, 5),114 ) , Try_convert(time, Substring (working_time, 18, 5),114 )) persisted

添加支持索引

create clustered index Ix_Id_WorkingDuration on ticket(ticket_id,workingdate)

然后您的查询变为:

with w as (
    select ticket_Id, workingDate, Sum(workingDuration) d
    from ticket
    group by ticket_id, workingDate
)
select ticket_id,
  workingdate as [The date],
  format(d / 60 * 100 + d % 60, '#:0#') hrs_worked_per_ticket
from w
where d>0;

See amended Fiddle

与您的原始查询相比,不会在如此少的几行上产生任何明显的改进,但在大型数据集上的表现会明显更好,特别是如果您需要按日期或范围进一步过滤。

然而,估计的执行计划建议此版本为 18%,而原始版本为 82%。