SQL 一组中第一行和第二行之间的平均时间

SQL average time between first and second row in a set

哎呀!我的耳朵很痛..以为我有过几次但史诗般的失败:(

我有以下数据,数百万行,索引,MySQL 5.6.

在这个table里面有几组数据,uuid基本上就是每组数据的唯一id。

我需要在每组第一行和第二行的数据中找到平均值。换句话说,自从为同一组使用第一个插入和第二个插入创建该组以来过去了多少时间,然后是结果的平均值。

我可以得到平均值没问题,我似乎无法想出一种方法来获得每组中第一行和第二行之间的时间差。

我什至不会让自己难堪,也不会把我破碎的 SQL 与我使用 sub-queries 和 LIMIT 的错误尝试粘贴在一起,我只想说,这个让我逃脱了。

感谢任何帮助,请喝啤酒:/

+------+-----------------------------------------+----------------------------+ | id | uuid | stamp | +------+-----------------------------------------+----------------------------+ | 707 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:00:28.000000 | | 708 | 60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 | 2020-01-01 17:01:30.000000 | | 709 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 14:11:48.000000 | | 710 | 59c8-60ee-d172-511a-a477-c637-6789-f14a | 2020-01-02 14:23:36.000000 | | 711 | b33b-7584-1fed-e138-28ba-c24a-9b46-88e7 | 2020-01-02 14:24:07.000000 | | 712 | eddc-b12a-5ef2-baea-cf53-7287-5805-d922 | 2020-01-02 14:24:26.000000 | | 713 | 257b-fc66-6d7a-ba21-727e-1da7-0ee1-714c | 2020-01-02 14:25:31.000000 | | 718 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:46:41.000000 | | 719 | 0ccf-94e0-ce72-8092-1975-5bea-6131-c719 | 2020-01-02 15:55:42.000000 | | 720 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 15:56:33.000000 | | 722 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:16:14.000000 | | 723 | c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c | 2020-01-02 16:21:25.000000 | | 726 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:16:33.000000 | | 727 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:21:20.000000 | | 728 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:45:07.000000 | | 729 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 17:50:17.000000 | | 730 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:14:02.000000 | | 731 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:27:48.000000 | | 732 | 6610-a9df-358d-0065-beb8-cea1-82a6-3258 | 2020-01-02 18:28:57.000000 | | 733 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:40.000000 | | 734 | c193-a46f-1104-3ee3-7387-94a8-ef32-a85e | 2020-01-02 18:40:49.000000 |

如果一个用户标识只出现两次,那么这是微不足道的。您有数百万行,所以让我们尽量避免排序并假设您有正确的索引。

这是获取最早两行的一种方法:

select t.*
from t
where t.stamp <= (select t2.stamp
                  from t t2
                  where t2.uuid = t.uuid
                  order by t2.stamp asc
                  limit 1,1
                 );

非常重要:您需要在 (uuid, stamp) 上建立索引以实现任何性能希望。

然后,只是聚合:

select uuid, timestampdiff(second, min(stamp), max(stamp))
from (select t.*
      from t
      where t.stamp <= (select t2.stamp
                        from t t2
                        where t2.uuid = t.uuid
                        order by t2.stamp asc
                        limit 1,1
                       )
     ) t
group by uuid;

另一种方法是使用 LEFT JOIN 而不是子查询。

SELECT
      t.uuid
    , t.stamp AS t_stamp
    , t_next.stamp AS t_next_stamp
    , TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp)) AS diff
FROM
    ttt AS t
    LEFT JOIN ttt AS t_prev ON (
            t_prev.uuid  = t.uuid
        AND t_prev.stamp < t.stamp
    )
    INNER JOIN ttt AS t_next ON (
            t_next.uuid  = t.uuid
        AND t_next.stamp > t.stamp
    )
    LEFT JOIN ttt AS t_before_next ON (
            t_before_next.uuid  = t.uuid
        AND t_before_next.stamp > t.stamp
        AND t_before_next.stamp < t_next.stamp 
    )
WHERE
        t_prev.id IS NULL -- no t_prev so t is the first record
    AND t_before_next.id IS NULL -- no t_before_next so t_next is the second record
    -- filter data by your criteria, per day for example.
    -- you will need to "duplicate" filtering conditions for t_prev and t_next
ORDER BY
    uuid

=>

uuid    t_stamp t_next_stamp    diff
0ccf-94e0-ce72-8092-1975-5bea-6131-c719 2020-01-02 14:11:48 2020-01-02 15:55:42 6234
60b5-d062-5829-c11d-5b71-5d85-075b-a3c5 2020-01-01 17:00:28 2020-01-01 17:01:30 62
6610-a9df-358d-0065-beb8-cea1-82a6-3258 2020-01-02 17:16:33 2020-01-02 17:21:20 287
c193-a46f-1104-3ee3-7387-94a8-ef32-a85e 2020-01-02 18:40:40 2020-01-02 18:40:49 9
c5d9-acba-9a12-aacb-cf45-c5a9-2b8d-314c 2020-01-02 15:46:41 2020-01-02 15:56:33 592

警告:

上面的查询将丢失具有相同戳记的记录。如果您需要它们,您必须更改加入条件:

来自

t_prev.stamp < t.stamp

t_prev.stamp <= t.stamp AND t_prev.id < t.id

等等

然后你可以使用查询得到AVG:

-- explain
SELECT
    AVG(TIME_TO_SEC(TIMEDIFF(t_next.stamp, t.stamp))) AS avg_diff
FROM
    ttt AS t
    LEFT JOIN ttt AS t_prev ON (
            t_prev.uuid  = t.uuid
        AND t_prev.stamp < t.stamp
    )
    INNER JOIN ttt AS t_next ON (
            t_next.uuid  = t.uuid
        AND t_next.stamp > t.stamp
    )
    LEFT JOIN ttt AS t_before_next ON (
            t_before_next.uuid  = t.uuid
        AND t_before_next.stamp > t.stamp
        AND t_before_next.stamp < t_next.stamp 
    )
WHERE
        t_prev.id IS NULL
    AND t_before_next.id IS NULL

=> 1436.8000(对于你的数据集)

用复合索引(uuid, stamp)说明:

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   SIMPLE  t   NULL    index   ix_uuid_stamp   ix_uuid_stamp   49  NULL    21  100.00  Using where; Using index
1   SIMPLE  t_prev  NULL    ref ix_uuid_stamp   ix_uuid_stamp   43  test.t.uuid 2   10.00   Using where; Not exists; Using index
1   SIMPLE  t_next  NULL    ref ix_uuid_stamp   ix_uuid_stamp   43  test.t.uuid 2   33.33   Using where; Using index
1   SIMPLE  t_before_next   NULL    ref ix_uuid_stamp   ix_uuid_stamp   43  test.t.uuid 2   10.00   Using where; Not exists; Using index

"ref" 用于代替接受的答案中的 "depended sub-query"。 什么更好取决于您的数据。 如果筛选的数据集(当您按天筛选记录时)很小,"depended sub query" 会更快。在过滤后的大数据集上,我更喜欢 "ref".

请随意测试这两种方式,让我们知道哪种方式更快。