memsql 中的 80M 记录每 select 5 秒
5 seconds per select from 80M records in memsql
我们正在将 memsql 5.1 用于网络分析项目。每天大约有 8000 万条记录和 0,500 万条记录。一个简单的请求大约需要 5 秒——在给定的一天每个域、地理区域、语言接收了多少数据。我觉得可以减少这些时间,但我找不到办法。请告诉我路
像一个表
CREATE TABLE `domains` (
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`geo` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`lang` char(5) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`browser` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`os` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`device` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`domain` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`ref` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`blk_cnt` int(11) DEFAULT NULL,
KEY `date` (`date`,`geo`,`lang`,`domain`) /*!90619 USING CLUSTERED COLUMNSTORE */
/*!90618 , SHARD KEY () */
)
这样的请求:
memsql> explain SELECT domain, geo, lang, avg(blk_cnt) as blk_cnt, count(*) as cnt FROM domains WHERE date BETWEEN '2016-07-31 0:00' AND '2016-08-01 0:00' GROUP BY domain, geo, lang ORDER BY blk_cnt ASC limit 40;
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [r0.domain, r0.geo, r0.lang, [=12=] / CAST(COALESCE(,0) AS SIGNED) AS blk_cnt, CAST(COALESCE(,0) AS SIGNED) AS cnt] |
| Top limit:40 |
| GatherMerge [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] partitions:all est_rows:40 |
| Project [r0.domain, r0.geo, r0.lang, s / CAST(COALESCE(c,0) AS SIGNED) AS blk_cnt, CAST(COALESCE(cnt_1,0) AS SIGNED) AS cnt, s, c, cnt_1] est_rows:40 |
| TopSort limit:40 [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] |
| HashGroupBy [SUM(r0.s) AS s, SUM(r0.c) AS c, SUM(r0.cnt) AS cnt_1] groups:[r0.domain, r0.geo, r0.lang] |
| TableScan r0 storage:list stream:no |
| Repartition [domains.domain, domains.geo, domains.lang, cnt, s, c] AS r0 shard_key:[domain, geo, lang] est_rows:40 est_select_cost:144350216 |
| HashGroupBy [COUNT(*) AS cnt, SUM(domains.blk_cnt) AS s, COUNT(domains.blk_cnt) AS c] groups:[domains.domain, domains.geo, domains.lang] |
| Filter [domains.date >= '2016-07-31 0:00' AND domains.date <= '2016-08-01 0:00'] |
| ColumnStoreScan scan_js_data.domains, KEY date (date, geo, lang, domain) USING CLUSTERED COLUMNSTORE est_table_rows:72175108 est_filtered:18043777 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
应用建议后
- 原始查询时间 - 5s
- 时间戳优化 - 3.7 秒
- 带时间戳+shardkey - 2.6s
非常感谢!
执行分组依据可能是此查询中成本最高的部分。使用与group by相匹配的shard key,即SHARD KEY (domain, geo, lang)
,将使group by执行得更快。
我们正在将 memsql 5.1 用于网络分析项目。每天大约有 8000 万条记录和 0,500 万条记录。一个简单的请求大约需要 5 秒——在给定的一天每个域、地理区域、语言接收了多少数据。我觉得可以减少这些时间,但我找不到办法。请告诉我路
像一个表
CREATE TABLE `domains` (
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`geo` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`lang` char(5) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`browser` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`os` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`device` varchar(20) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`domain` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`ref` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`blk_cnt` int(11) DEFAULT NULL,
KEY `date` (`date`,`geo`,`lang`,`domain`) /*!90619 USING CLUSTERED COLUMNSTORE */
/*!90618 , SHARD KEY () */
)
这样的请求:
memsql> explain SELECT domain, geo, lang, avg(blk_cnt) as blk_cnt, count(*) as cnt FROM domains WHERE date BETWEEN '2016-07-31 0:00' AND '2016-08-01 0:00' GROUP BY domain, geo, lang ORDER BY blk_cnt ASC limit 40;
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project [r0.domain, r0.geo, r0.lang, [=12=] / CAST(COALESCE(,0) AS SIGNED) AS blk_cnt, CAST(COALESCE(,0) AS SIGNED) AS cnt] |
| Top limit:40 |
| GatherMerge [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] partitions:all est_rows:40 |
| Project [r0.domain, r0.geo, r0.lang, s / CAST(COALESCE(c,0) AS SIGNED) AS blk_cnt, CAST(COALESCE(cnt_1,0) AS SIGNED) AS cnt, s, c, cnt_1] est_rows:40 |
| TopSort limit:40 [SUM(r0.s) / CAST(COALESCE(SUM(r0.c),0) AS SIGNED)] |
| HashGroupBy [SUM(r0.s) AS s, SUM(r0.c) AS c, SUM(r0.cnt) AS cnt_1] groups:[r0.domain, r0.geo, r0.lang] |
| TableScan r0 storage:list stream:no |
| Repartition [domains.domain, domains.geo, domains.lang, cnt, s, c] AS r0 shard_key:[domain, geo, lang] est_rows:40 est_select_cost:144350216 |
| HashGroupBy [COUNT(*) AS cnt, SUM(domains.blk_cnt) AS s, COUNT(domains.blk_cnt) AS c] groups:[domains.domain, domains.geo, domains.lang] |
| Filter [domains.date >= '2016-07-31 0:00' AND domains.date <= '2016-08-01 0:00'] |
| ColumnStoreScan scan_js_data.domains, KEY date (date, geo, lang, domain) USING CLUSTERED COLUMNSTORE est_table_rows:72175108 est_filtered:18043777 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
应用建议后
- 原始查询时间 - 5s
- 时间戳优化 - 3.7 秒
- 带时间戳+shardkey - 2.6s
非常感谢!
执行分组依据可能是此查询中成本最高的部分。使用与group by相匹配的shard key,即SHARD KEY (domain, geo, lang)
,将使group by执行得更快。