在 table 上实时聚合数百万条记录

Question

我正在处理一个不断增长的 table，其中目前包含大约 500 万条记录。每天大约增加 100000 条新记录。

table 包含有关广告活动的信息，并在查询时与另一个 table 合并：

CREATE TABLE `statistics` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `ip_range_id` int(11) DEFAULT NULL,
    `campaign_id` int(11) DEFAULT NULL,
    `payout` decimal(5,2) DEFAULT NULL,
    `is_converted` tinyint(1) unsigned NOT NULL DEFAULT '0',
    `converted` datetime DEFAULT NULL,
    `created` datetime DEFAULT NULL,
    PRIMARY KEY (`id`),
    KEY `created` (`created`),
    KEY `converted` (`converted`),
    KEY `campaign_id` (`campaign_id`),
    KEY `ip_range_id` (`ip_range_id`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

另一个 table 包含 IP 范围：

CREATE TABLE `ip_ranges` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `ip_range` varchar(11) NOT NULL,
    PRIMARY KEY (`id`),
    KEY `ip_range` (`ip_range`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

聚合查询如下：

SELECT
    SUM(`payout`) AS `revenue`, 
    (SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id`) AS `clicks`, 
    (SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id` AND `is_converted` = 1) AS `conversions` 
FROM `ip_ranges` AS `IpRange` 
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id` 
ORDER BY `clicks` DESC 
LIMIT 20

查询大约需要 20 秒才能完成。

这就是解释 returns:

id  select_type         table       type   possible_keys    key          key_len  ref               rows    Extra

1   PRIMARY             ip_range    index  PRIMARY          PRIMARY      4        NULL              306552  Using index; Using temporary; Using filesort
1   PRIMARY             statistic   ref    ip_range_id      ip_range_id  5        db.ip_range.id    8       Using where
3   DEPENDENT SUBQUERY  statistics  ref    ip_range_id      ip_range_id  5        func              8       Using where
2   DEPENDENT SUBQUERY  statistics  ref    ip_range_id      ip_range_id  5        func              8       Using where; Using index

将 ip_ranges table 中的点击和转化缓存为额外的列不是一个选项，因为我还需要能够过滤 campaign_id 列（并且可能将来的其他专栏）。所以这些聚合需要有点实时。

在多个维度上近乎实时地对大型 table 进行聚合的最佳策略是什么？

请注意，我不一定只想让查询变得更好，但我也对可能涉及其他数据库系统 (NoSQL) and/or 分布数据的策略感兴趣通过不同的服务器等

Answer 1

试试这个

SELECT
    SUM(`payout`) AS `revenue`, 
    SUM(case when `ip_range_id` = `IpRange`.`id` then 1 else 0 end) AS `clicks`, 
    SUM(case when `ip_range_id` = `IpRange`.`id` and `is_converted` = 1 then 1 else 0 end)  
      AS `conversions` 
FROM `ip_ranges` AS `IpRange` 
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id` 
ORDER BY `clicks` DESC 
LIMIT 20

Answer 2

您的查询看起来过于复杂。不需要一次又一次查询相同的table：

select
  sum(payout) as revenue, 
  count(*) as clicks, 
  sum(s.is_converted = 1) as conversions 
from ip_ranges r
inner join statistics s on r.id = s.ip_range_id
group by r.id 
order by clicks desc 
limit 20;

编辑（接受后）：关于如何处理这样的任务的实际问题：

您想查看所有您 table 中的数据并且您希望结果是 最新的.那么除了读取所有数据（完整 table 扫描）之外别无选择。如果 table 很宽（即有很多列），您可能想要创建覆盖索引（即包含所有涉及的列的索引），因此不是读取 table，而是读取索引。那么，还有什么？在完整 table 扫描中，建议使用并行访问，据我所知 MySQL 不提供。所以你可能想切换到另一个 DBMS。然后看看 DBMS 还提供什么。也许并行查询会受益于 table 的分区。最后想到的是硬件，即更多 CPU、更快的驱动器等。

另一种选择可能是从您的 table 中删除旧数据。假设您需要当年的详细信息，但只需要前几年的汇总数据。然后让另一个 table old_statistics 只保存所需的总和和计数，例如

table old_statistics
(
  ip_range_id,
  revenue,
  conversions
);

然后，您将从统计数据中汇总数据，因为它只包含当年的数据，所以它会小得多，然后添加 old_statistics 以获得结果。

在 table 上实时聚合数百万条记录

Real-time aggregation on a table with millions of records

mysql

sql

aggregation