查找月度缺失的数据

find data missing on monthly base

有了这个数据 table “table”:

和众所周知的 newID 值 (1,2,3)

的 table“ID”

我想找出所有缺少某些 newID 值的月份。结果应仅包括至少有 1 newId.

的月份

这是上面列出的数据的预期结果:

我怎样才能完成这个任务?

Table 架构:(fiddle) http://sqlfiddle.com/#!9/72af42

CREATE TABLE `ID`(
    Id int unsigned not null primary key
);
INSERT INTO `ID` (Id) VALUES (1),(2),(3);

CREATE TABLE `table`(
    recId int unsigned not null auto_increment primary key,
    Data DateTime not null,
    newID int not null
);

INSERT INTO `table` (`Data`,`newID`) VALUES
('2017-12-06',1),
('2017-12-06',3),
('2017-11-16',1),
('2017-11-16',2),
('2017-11-16',3),
('2017-10-05',2),
('2017-10-05',3),
('2017-10-03',2),
('2017-10-03',3),
('2017-08-16',1),
('2017-08-16',2),
('2017-08-16',3),
('2017-05-05',1),
('2017-05-05',2),
('2017-05-05',3);

因为您只需要缺失的 ID 具有相关 Month 的记录,这是一个简单的 非相关 子查询场景。

In SQL a Non-Correlated sub-query can be modeled either using a NOT EXISTS expression in the WHERE clause, or you can use a LEFT OUTER JOIN and only include the NULL results for the columns in the joined table, as this would indicate the records where no match was found

此查询仅因您要计算日期的 MONTH 部分而不是显式日期值这一事实而变得复杂。 SQL 提供了所有必要的工具,我们甚至可以格式化您想要的输出:

SELECT DATE_FORMAT(CAST(CONCAT(m.year,'-',m.month,'-01') as DateTime), '%b-%y') as Data, i.Id as newId
FROM (
  SELECT YEAR(Data) AS Year, MONTH(Data) AS month
  FROM `table`
  GROUP BY YEAR(Data), MONTH(Data)
) m
CROSS JOIN `ID` i
LEFT OUTER JOIN `table` t ON YEAR(t.Data) = m.year AND MONTH(t.Data) = m.month AND t.newId = i.Id
WHERE t.Data IS NULL
ORDER BY m.Year DESC, m.month DESC

看看这个fiddle:http://sqlfiddle.com/#!9/72af42/2


Choosing between WHERE NOT EXISTS AND LEFT OUTER JOIN can affect performance slightly, but the affect will depend on your query, your RDBMS and the available indexes. I personally use the JOIN syntax first because IMO it is simpler to maintain, but you use your own discretion.

There is a lot of talk at least in MS SQL that NOT EXISTS should be faster than JOIN but if performance is an issue you for this specific query you should look at storing the year and month columns as persisted values so that they can be indexed and to reduce the function evaluatations.because it will evaluate less lookups.

为了进行比较,这是等效的 WHERE NOT EXISTS 查询:http://sqlfiddle.com/#!9/72af42/5

SELECT DATE_FORMAT(CAST(CONCAT(m.year,'-',m.month,'-01') as DateTime), '%b-%y') as Data, i.Id as newId
FROM (
  SELECT YEAR(Data) AS Year, MONTH(Data) AS month
  FROM `table`
  GROUP BY YEAR(Data), MONTH(Data)
) m
CROSS JOIN `ID` i
WHERE NOT EXISTS (
  SELECT * 
  FROM `table` t 
  WHERE YEAR(t.Data) = m.year 
    AND MONTH(t.Data) = m.month 
    AND t.newId = i.Id
)
ORDER BY m.Year DESC, m.month DESC

如何使用持久值进行优化?

如果我们对YEAR()MONTH()进行预运算,直接将结果存储在table中,那么查询速度会提高,但我们也可以加索引超充电。

Consider the over all PROs and CONs before going this far...

  • Do you really need this level of optimisation?
  • How often is the query going to be executed?
  • Can you change the application logic to use a more appropriate WHERE clause to restrict the scope of the data instead?

物化视图

对此的一种解决方案是创建和管理实体化视图。这是一种 DW 技术,它可以有效地让您定义一个视图,但定期执行并存储到它自己的 table space.

A materialised View does not optimise your query, but it allows you to execute complex and long-running query once, so that the results can be queried directly like a normal table, without having to re-evaluate column expressions.

您的数据和查询类型看起来很适合物化视图,因为它查询的历史数据变化率为零或非常低,仅更新新行,我们可能不关心当月结果。在这种情况下,如果您最终 运行 查询多次,并且结果或多或少保持不变,那么为什么不 运行 查询作为一个过程,比如说每个月并将结果存储在一个专门构建的 table,那么您的应用程序可以经常查询 table 并获得闪电般快速的结果。

MySQL does not support Materialized Views, but you can replicate the concept as explained above in your application logic, some other RDBMS provide this OOTB, its the concept that should be considered.

计算列

您可以将额外的列添加到您的 table 并根据 user/application 逻辑维护这些列,但这不是很可靠,除非您信任您的应用程序开发人员并且该应用程序是唯一将更新此 table.

的进程

在这种情况下,计算列非常适合可靠性,但只有当您可以将值保存到列存储时,它们才能帮助我们提高性能。 (计算列的默认状态是表达式将在执行时计算,这对当前查询没有什么好处)

Again this is where MySQL will let you down, many other RDBMS offer simpler ways to do this, you need MySQL v5.7 for this to work

ALTER TABLE `table` ADD `year` GENERATED ALWAYS AS (YEAR(Data)) STORED;
ALTER TABLE `table` ADD `month` GENERATED ALWAYS AS (MONTH(Data)) STORED;

触发器

您的另一个选择是添加列,然后使用触发器来维护值,MySQL 并不容易,但它可以工作

  1. 将列添加到您的 table:

    ALTER TABLE `table` ADD (`year` int NULL);
    ALTER TABLE `table` ADD (`month` int NULL);
    
  2. 创建触发器来管理这些列中的值,以便用户无法覆盖它们:

    DELIMITER $$
    
    CREATE TRIGGER persist_index_values_insert
      BEFORE INSERT ON `table` FOR EACH ROW
    BEGIN
      SET new.year= YEAR(new.Data);
      SET new.month = MONTH(NEW.Data);
    END$$
    
    CREATE TRIGGER persist_index_values_update
    BEFORE UPDATE ON `table` FOR EACH ROW
    BEGIN
      SET NEW.year = YEAR(NEW.Data);
      SET NEW.month = MONTH(NEW.Data);
    END$$
    

    分隔符;

  3. 更简单的查询:

     SELECT DATE_FORMAT(CAST(CONCAT(m.year,'-',m.month,'-01') as DateTime), '%b-%y') as Data, i.Id as newId
     FROM (
       SELECT `year`, `month`
       FROM `table`
       GROUP BY `year`, `month`
     ) m
     CROSS JOIN `ID` i
     LEFT OUTER JOIN `table` t ON t.year = m.year AND t.month = m.month AND t.newId = i.Id
     WHERE t.Data IS NULL
     ORDER BY m.Year DESC, m.month DESC
    
  4. 现在可以根据需要应用索引,您应该查阅查询执行计划以获取指导,但我建议您需要一个索引用于 yearmonthnewId 至少:

      CREATE INDEX IX_TABLE_YEAR_MONTH_NEWID ON `table` (`year`,`month`,'newId');