获取通过特定摄像头的汽车

Question

MYSQL/MARIADB 架构和示例数据：

CREATE DATABASE IF NOT EXISTS `puzzle` DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;

USE `puzzle`;

DROP TABLE IF EXISTS `event`;

CREATE TABLE `event` (
  `eventId` bigint(20) NOT NULL AUTO_INCREMENT,
  `sourceId` bigint(20) NOT NULL COMMENT 'think of source as camera',
  `carNumber` varchar(40) NOT NULL COMMENT 'ex: 5849',
  `createdOn` datetime DEFAULT NULL,
  PRIMARY KEY (`eventId`)
) ENGINE=INNODB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;


INSERT INTO `event` (`eventId`, `sourceId`, `carNumber`, `createdOn`) VALUES
    (1, 44, '4456', '2016-09-20 20:24:05'),
    (2, 26, '26484', '2016-09-20 20:24:05'),
    (3, 5, '4456', '2016-09-20 20:24:06'),
    (4, 3, '72704', '2016-09-20 20:24:15'),
    (5, 3, '399606', '2016-09-20 20:26:15'),
    (6, 5, '4456', '2016-09-20 20:27:25'),
    (7, 44, '72704', '2016-09-20 20:29:25'),
    (8, 3, '4456', '2016-09-20 20:30:55'),
    (9, 44, '26484', '2016-09-20 20:34:55'),
    (10, 26, '4456', '2016-09-20 20:35:15'),
    (11, 3, '72704', '2016-09-20 20:35:15'),
    (12, 3, '399606', '2016-09-20 20:44:35'),
    (13, 26, '4456', '2016-09-20 20:49:45');

我想在 20:24 到 20:45 期间获取 sourceId = 3 AND (26 OR 44) 的 CarNumber(s)。查询需要快速，因为真实的 table 包含超过 3 亿条记录。

到目前为止，下面是我可以进行查询的最大值（它甚至没有产生有效结果）

select * from event e where 
e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00' 
and e.sourceId IN(3,26,44) group by e.carNumber;

所提供数据的正确结果：

carNumber
4456
72704

真是百思不得其解，卡壳了。我尝试了 EXISTS、Joins、子查询，但没有成功，所以我想知道 SQL 是否能够解决这个问题，或者我应该使用后端编码吗？

MySQL/正在使用的 MariaDB 版本：

mariadb-5.5.50

mysql-5.5.51

Answer 1

像下面这样的东西应该可以解决问题：

 SELECT carNumber
 FROM event
 WHERE sourceID = 3
     AND carNumber IN (SELECT carNumber FROM event WHERE sourceID IN(26,44))
 GROUP BY carNumber

WHERE 子句查找 sourceID 为 3 的记录，然后还确保 carnumber 在 table 中至少有一个其他记录，其中sourceid 是 26 或 44

不要为此编写 SQL 之外的任何代码，因为这绝对是 SQL 旨在尽快解决的问题。

Answer 2

您可以使用 having 子句来过滤群组。使用sum()计算一组数据中某些条件出现的次数

select e.carNumber 
from event e 
where e.createdOn > '2016-09-20 20:24:00' 
  and e.createdOn < '2016-09-20 20:45:00'
group by e.carNumber
having sum(e.sourceId = 3) > 0
   and sum(e.sourceId IN (26,44)) > 0

Answer 3

如果您需要它更快，那么以下可能工作，假设您在 event(createdOn, carNumber, SourceId) 上有一个索引：

select e.carNumber 
from event e 
where e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
       sum(e.sourceId IN (26, 44)) > 0;

我倾向于将其更改为：

select e.carNumber 
from event e 
where e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00' and
      e.sourceId in (3, 26, 44)
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
       sum(e.sourceId IN (26, 44)) > 0;

然后为了性能，即使这样：

select carNumber
from ((select carNumber, sourceId
       from event e
       where e.sourceId = 3 and
             e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
      ) union all
      (select carNumber, sourceId
       from event e
       where e.sourceId = 26 and
             e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
      ) union all
      (select carNumber, sourceId
       from event e
       where e.sourceId = 44 and
             e.createdOn > '2016-09-20 20:24:00' and e.createdOn < '2016-09-20 20:45:00'
      )
     ) e
group by e.carNumber
having sum(e.sourceId = 3) > 0 and
       sum(e.sourceId IN (26, 44)) > 0;

此版本可以利用 event(sourceId, createdOn, carNumber) 上的索引。每个子查询都应该非常有效地使用这个索引，将少量数据聚集在一起进行最终聚合。

Answer 4

缩小 table 尺寸

对于 300M 行，您确实应该使用实用的最小数据类型。

BIGINT占用8个字节； INT UNSIGNED（仅 4 个字节）通常就足够了（最多 40 亿）。如果少于65K个摄像头，使用一个2字节的SMALLINT UNSIGNED.
carNumber看起来像个数字，为什么要用VARCHAR呢？您的示例在 VARCHAR 中占用 5-7 个字节，在 INT UNSIGNED 中占用 4 个字节，在 MEDIUMINT UNSIGNED 中占用 3 个字节（最大 16M）。

缩小 table 将有助于选择任何解决方案。

覆盖指数

这已在其他答案中提出，但我想说清楚为什么它有帮助。如果所有列都存在于单个查询中，则可以在索引的 BTree 中执行查询，而无需触及数据。由于较小，这通常会更快。此查询的 'covering' 索引具有任意顺序的 source_id, car_number, createdOn。

索引中列的顺序

由于索引只能从左到右使用，因此顺序很重要。（这不适用于 Gordon 的第一个 select，它首先需要 createdOn。）

sourceId是用=或IN来处理的，所以它应该排在第一位。在 IN 的情况下，您可能需要 5.6 或更高版本才能获得 IN 优化。
createdOn 是一个范围，因此查找将在此范围内停止。
对于 "covering"，现在可以添加任何额外的列。在这种情况下，carNumber.

所以，大多数（不是所有）建议都需要这个顺序：INDEX(sourceId, createdOn, carNumber)。

去掉auto_increment

你在其他table中使用eventID吗？如果是这样，那么您可能应该保留它。如果不是，那么组合 (sourceId, createdOn, carNumber) 是唯一的吗？如果是这样，则将其设置为 PRIMARY KEY。代理 PK 在某些情况下很好，但在其他情况下会阻碍性能。我建议它可能成为这里的障碍。

避免缓慢的操作

UNION 通常涉及临时 table；这增加了开销。虽然 UNION 有利于更好地利用索引并避免 OR，但 tmp table 的开销可能会超过看起来很小的结果集的好处。

Gordon 使用 UNION ALL 而不是默认的 UNION DISTINCT 是对的；后者需要一个 de-dup pass，这对他的查询来说是不必要的。

底线

缩小 table。
如果可能，改变PK；如果没有，请添加建议的索引。
至少升级到 5.6
使用 Gordon 的第二个查询。

另一种解决方案

（我不知道这是否更好，但可能值得一试。）

SELECT carNumber 
    FROM ( SELECT DISTINCT carNumber
           FROM event
           WHERE sourceId = 3
             AND createdOn >= '2016-09-20 20:24:00'
             AND createdOn  < '2016-09-20 20:45:00'
         ) AS x
    WHERE EXISTS ( SELECT * FROM event
            WHERE carNumber = x.carNumber
              AND sourceId IN (26,44)
              AND createdOn >= '2016-09-20 20:24:00'
              AND createdOn  < '2016-09-20 20:45:00'
                 );

需要两个索引：

(sourceId, createdOn, carNumber)  -- as before
(carNumber, sourceId, createdOn)  -- to optimize the EXISTS

获取通过特定摄像头的汽车

Get the cars that passed specific cameras

mysql

sql

mariadb