SQL 一组中第一行和第二行之间的平均时间
SQL average time between first and second row in a set
哎呀!我的耳朵很痛..以为我有过几次但史诗般的失败:(
我有以下数据,数百万行,索引,MySQL 5.6.
在这个table里面有几组数据,uuid基本上就是每组数据的唯一id。
我需要在每组第一行和第二行的数据中找到平均值。换句话说,自从为同一组使用第一个插入和第二个插入创建该组以来过去了多少时间,然后是结果的平均值。
我可以得到平均值没问题,我似乎无法想出一种方法来获得每组中第一行和第二行之间的时间差。
我什至不会让自己难堪,也不会把我破碎的 SQL 与我使用 sub-queries 和 LIMIT 的错误尝试粘贴在一起,我只想说,这个让我逃脱了。
感谢任何帮助,请喝啤酒:/
+------+-----------------------------------------+----------------------------+
| id | uuid | stamp |
+------+-----------------------------------------+----------------------------+
| 707 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:00:28.000000 |
| 708 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:01:30.000000 |
| 709 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 14:11:48.000000 |
| 710 | 59c8-60ee-d172-511a-a477-c637-6789-f14a | 2020-01-02 14:23:36.000000 |
| 711 | b33b-7584-1fed-e138-28ba-c24a-9b46-88e7 | 2020-01-02 14:24:07.000000 |
| 712 | eddc-b12a-5ef2-baea-cf53-7287-5805-d922 | 2020-01-02 14:24:26.000000 |
| 713 | 257b-fc66-6d7a-ba21-727e-1da7-0ee1-714c | 2020-01-02 14:25:31.000000 |
| 718 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:46:41.000000 |
| 719 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 15:55:42.000000 |
| 720 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:56:33.000000 |
| 722 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:16:14.000000 |
| 723 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:21:25.000000 |
| 726 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:16:33.000000 |
| 727 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:21:20.000000 |
| 728 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:45:07.000000 |
| 729 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:50:17.000000 |
| 730 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:14:02.000000 |
| 731 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:27:48.000000 |
| 732 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:28:57.000000 |
| 733 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:40.000000 |
| 734 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:49.000000 |
如果一个用户标识只出现两次,那么这是微不足道的。您有数百万行,所以让我们尽量避免排序并假设您有正确的索引。
这是获取最早两行的一种方法:
select t.*
from t
where t.stamp <= (select t2.stamp
from t t2
where t2.uuid = t.uuid
order by t2.stamp asc
limit 1,1
);
非常重要:您需要在 (uuid, stamp)
上建立索引以实现任何性能希望。
然后,只是聚合:
select uuid, timestampdiff(second, min(stamp), max(stamp))
from (select t.*
from t
where t.stamp <= (select t2.stamp
from t t2
where t2.uuid = t.uuid
order by t2.stamp asc
limit 1,1
)
) t
group by uuid;
另一种方法是使用 LEFT JOIN 而不是子查询。
SELECT
t.uuid
, t.stamp AS t_stamp
, t_next.stamp AS t_next_stamp
, TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp)) AS diff
FROM
ttt AS t
LEFT JOIN ttt AS t_prev ON (
t_prev.uuid = t.uuid
AND t_prev.stamp < t.stamp
)
INNER JOIN ttt AS t_next ON (
t_next.uuid = t.uuid
AND t_next.stamp > t.stamp
)
LEFT JOIN ttt AS t_before_next ON (
t_before_next.uuid = t.uuid
AND t_before_next.stamp > t.stamp
AND t_before_next.stamp < t_next.stamp
)
WHERE
t_prev.id IS NULL -- no t_prev so t is the first record
AND t_before_next.id IS NULL -- no t_before_next so t_next is the second record
-- filter data by your criteria, per day for example.
-- you will need to "duplicate" filtering conditions for t_prev and t_next
ORDER BY
uuid
=>
uuid t_stamp t_next_stamp diff
0ccf-94e0-ce72-8092-1975-5bea-6131-c719 2020-01-02 14:11:48 2020-01-02 15:55:42 6234
60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 2020-01-01 17:00:28 2020-01-01 17:01:30 62
6610-a9df-358d-0065-beb8-cea1-82a6-3258 2020-01-02 17:16:33 2020-01-02 17:21:20 287
c193-a46f-1104-3ee3-7387-94a8-ef32-a85e 2020-01-02 18:40:40 2020-01-02 18:40:49 9
c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c 2020-01-02 15:46:41 2020-01-02 15:56:33 592
警告:
上面的查询将丢失具有相同戳记的记录。如果您需要它们,您必须更改加入条件:
来自
t_prev.stamp < t.stamp
至
t_prev.stamp <= t.stamp AND t_prev.id < t.id
等等
然后你可以使用查询得到AVG:
-- explain
SELECT
AVG(TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp))) AS avg_diff
FROM
ttt AS t
LEFT JOIN ttt AS t_prev ON (
t_prev.uuid = t.uuid
AND t_prev.stamp < t.stamp
)
INNER JOIN ttt AS t_next ON (
t_next.uuid = t.uuid
AND t_next.stamp > t.stamp
)
LEFT JOIN ttt AS t_before_next ON (
t_before_next.uuid = t.uuid
AND t_before_next.stamp > t.stamp
AND t_before_next.stamp < t_next.stamp
)
WHERE
t_prev.id IS NULL
AND t_before_next.id IS NULL
=>
1436.8000(对于你的数据集)
用复合索引(uuid, stamp)说明:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t NULL index ix_uuid_stamp ix_uuid_stamp 49 NULL 21 100.00 Using where; Using index
1 SIMPLE t_prev NULL ref ix_uuid_stamp ix_uuid_stamp 43 test.t.uuid 2 10.00 Using where; Not exists; Using index
1 SIMPLE t_next NULL ref ix_uuid_stamp ix_uuid_stamp 43 test.t.uuid 2 33.33 Using where; Using index
1 SIMPLE t_before_next NULL ref ix_uuid_stamp ix_uuid_stamp 43 test.t.uuid 2 10.00 Using where; Not exists; Using index
"ref" 用于代替接受的答案中的 "depended sub-query"。
什么更好取决于您的数据。
如果筛选的数据集(当您按天筛选记录时)很小,"depended sub query" 会更快。在过滤后的大数据集上,我更喜欢 "ref".
请随意测试这两种方式,让我们知道哪种方式更快。
哎呀!我的耳朵很痛..以为我有过几次但史诗般的失败:(
我有以下数据,数百万行,索引,MySQL 5.6.
在这个table里面有几组数据,uuid基本上就是每组数据的唯一id。
我需要在每组第一行和第二行的数据中找到平均值。换句话说,自从为同一组使用第一个插入和第二个插入创建该组以来过去了多少时间,然后是结果的平均值。
我可以得到平均值没问题,我似乎无法想出一种方法来获得每组中第一行和第二行之间的时间差。
我什至不会让自己难堪,也不会把我破碎的 SQL 与我使用 sub-queries 和 LIMIT 的错误尝试粘贴在一起,我只想说,这个让我逃脱了。
感谢任何帮助,请喝啤酒:/
+------+-----------------------------------------+----------------------------+
| id | uuid | stamp |
+------+-----------------------------------------+----------------------------+
| 707 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:00:28.000000 |
| 708 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:01:30.000000 |
| 709 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 14:11:48.000000 |
| 710 | 59c8-60ee-d172-511a-a477-c637-6789-f14a | 2020-01-02 14:23:36.000000 |
| 711 | b33b-7584-1fed-e138-28ba-c24a-9b46-88e7 | 2020-01-02 14:24:07.000000 |
| 712 | eddc-b12a-5ef2-baea-cf53-7287-5805-d922 | 2020-01-02 14:24:26.000000 |
| 713 | 257b-fc66-6d7a-ba21-727e-1da7-0ee1-714c | 2020-01-02 14:25:31.000000 |
| 718 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:46:41.000000 |
| 719 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 15:55:42.000000 |
| 720 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:56:33.000000 |
| 722 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:16:14.000000 |
| 723 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:21:25.000000 |
| 726 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:16:33.000000 |
| 727 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:21:20.000000 |
| 728 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:45:07.000000 |
| 729 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:50:17.000000 |
| 730 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:14:02.000000 |
| 731 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:27:48.000000 |
| 732 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:28:57.000000 |
| 733 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:40.000000 |
| 734 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:49.000000 |
如果一个用户标识只出现两次,那么这是微不足道的。您有数百万行,所以让我们尽量避免排序并假设您有正确的索引。
这是获取最早两行的一种方法:
select t.*
from t
where t.stamp <= (select t2.stamp
from t t2
where t2.uuid = t.uuid
order by t2.stamp asc
limit 1,1
);
非常重要:您需要在 (uuid, stamp)
上建立索引以实现任何性能希望。
然后,只是聚合:
select uuid, timestampdiff(second, min(stamp), max(stamp))
from (select t.*
from t
where t.stamp <= (select t2.stamp
from t t2
where t2.uuid = t.uuid
order by t2.stamp asc
limit 1,1
)
) t
group by uuid;
另一种方法是使用 LEFT JOIN 而不是子查询。
SELECT
t.uuid
, t.stamp AS t_stamp
, t_next.stamp AS t_next_stamp
, TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp)) AS diff
FROM
ttt AS t
LEFT JOIN ttt AS t_prev ON (
t_prev.uuid = t.uuid
AND t_prev.stamp < t.stamp
)
INNER JOIN ttt AS t_next ON (
t_next.uuid = t.uuid
AND t_next.stamp > t.stamp
)
LEFT JOIN ttt AS t_before_next ON (
t_before_next.uuid = t.uuid
AND t_before_next.stamp > t.stamp
AND t_before_next.stamp < t_next.stamp
)
WHERE
t_prev.id IS NULL -- no t_prev so t is the first record
AND t_before_next.id IS NULL -- no t_before_next so t_next is the second record
-- filter data by your criteria, per day for example.
-- you will need to "duplicate" filtering conditions for t_prev and t_next
ORDER BY
uuid
=>
uuid t_stamp t_next_stamp diff
0ccf-94e0-ce72-8092-1975-5bea-6131-c719 2020-01-02 14:11:48 2020-01-02 15:55:42 6234
60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 2020-01-01 17:00:28 2020-01-01 17:01:30 62
6610-a9df-358d-0065-beb8-cea1-82a6-3258 2020-01-02 17:16:33 2020-01-02 17:21:20 287
c193-a46f-1104-3ee3-7387-94a8-ef32-a85e 2020-01-02 18:40:40 2020-01-02 18:40:49 9
c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c 2020-01-02 15:46:41 2020-01-02 15:56:33 592
警告:
上面的查询将丢失具有相同戳记的记录。如果您需要它们,您必须更改加入条件:
来自
t_prev.stamp < t.stamp
至
t_prev.stamp <= t.stamp AND t_prev.id < t.id
等等
然后你可以使用查询得到AVG:
-- explain
SELECT
AVG(TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp))) AS avg_diff
FROM
ttt AS t
LEFT JOIN ttt AS t_prev ON (
t_prev.uuid = t.uuid
AND t_prev.stamp < t.stamp
)
INNER JOIN ttt AS t_next ON (
t_next.uuid = t.uuid
AND t_next.stamp > t.stamp
)
LEFT JOIN ttt AS t_before_next ON (
t_before_next.uuid = t.uuid
AND t_before_next.stamp > t.stamp
AND t_before_next.stamp < t_next.stamp
)
WHERE
t_prev.id IS NULL
AND t_before_next.id IS NULL
=> 1436.8000(对于你的数据集)
用复合索引(uuid, stamp)说明:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t NULL index ix_uuid_stamp ix_uuid_stamp 49 NULL 21 100.00 Using where; Using index
1 SIMPLE t_prev NULL ref ix_uuid_stamp ix_uuid_stamp 43 test.t.uuid 2 10.00 Using where; Not exists; Using index
1 SIMPLE t_next NULL ref ix_uuid_stamp ix_uuid_stamp 43 test.t.uuid 2 33.33 Using where; Using index
1 SIMPLE t_before_next NULL ref ix_uuid_stamp ix_uuid_stamp 43 test.t.uuid 2 10.00 Using where; Not exists; Using index
"ref" 用于代替接受的答案中的 "depended sub-query"。 什么更好取决于您的数据。 如果筛选的数据集(当您按天筛选记录时)很小,"depended sub query" 会更快。在过滤后的大数据集上,我更喜欢 "ref".
请随意测试这两种方式,让我们知道哪种方式更快。