根据它们之间的间隔对时间戳进行分组

Grouping Timestamps based on the interval between them

我在 Hive (SQL) 中有一个 table,其中有一堆时间戳,需要对其进行分组,以便根据时间戳之间的时间差创建单独的会话。

示例: 考虑以下时间戳(为简单起见,在 HH:MM 中给出): 9.00 9.10 9.20 9.40 9.43 10.30 10.45 11.25 12.30 12.33 等等..

所以现在,所有落在下一个时间戳 30 分钟内的时间戳都属于同一个会话, 即 9.00、9.10、9.20、9.40、9.43 形式 1 会话。

但由于 9.43 和 10.30 之间的差异超过 30 分钟,时间戳 10.30 属于不同的会话。同样,10.30 和 10.45 属于一个时段。

创建这些会话后,我们必须获取该会话的最小时间戳和最大时间戳。

我尝试用它的 LEAD 减去当前时间戳,并在它大于 30 分钟时放置一个标志,但我很难做到这一点。

非常感谢你们的任何建议。如果问题不够清楚,请告诉我。

此示例数据的预期输出:

Session_start   Session_end
9.00                9.43
10.30               10.45
11.25               11.25 (same because the next time is not within 30 mins)
12.30               12.33

希望对您有所帮助。

试试这个:

SELECT DATE_FORMAT(MIN(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_start, 
       DATE_FORMAT(MAX(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_end
FROM tableA A
LEFT JOIN ( SELECT A.column1, diff, IF(@diff:=diff < 30, @id, @id:=@id+1) AS rnk
            FROM (SELECT B.column1, TIME_TO_SEC(TIMEDIFF(STR_TO_DATE(B.column1, '%H.%i'), STR_TO_DATE(A.column1, '%H.%i'))) / 60 AS diff
                  FROM tableA A
                  INNER JOIN tableA B ON STR_TO_DATE(A.column1, '%H.%i') < STR_TO_DATE(B.column1, '%H.%i') 
                  GROUP BY STR_TO_DATE(A.column1, '%H.%i')
                 ) AS A, (SELECT @diff:=0, @id:= 1) AS B
           ) AS B ON A.column1 = B.column1
GROUP BY IFNULL(B.rnk, 1);

勾选SQL FIDDLE DEMO

输出

| SESSION_START | SESSION_END |
|---------------|-------------|
|          9.00 |        9.43 |
|         10.30 |       10.45 |
|         11.25 |       11.25 |
|         12.30 |       12.33 |

试试这个..

SELECT MIN(session_time_tmp) session_start, MAX(session_time_tmp) session_end FROM 
(
SELECT  IF((TIME_TO_SEC(TIMEDIFF(your_time_field, COALESCE(@previousValue, your_time_field))) / 60) > 30 , 
        @sessionCount := @sessionCount + 1, @sessionCount ) sessCount, 
        ( @previousValue := your_time_field ) session_time_tmp  FROM 
(
SELECT your_time_field, @previousValue:= NULL, @sessionCount := 1 FROM yourtable ORDER BY your_time_field
) a
) b
GROUP BY sessCount

只需替换 yourtableyour_time_field

由于 MySQL 缺少 LAG 和 LEAD 功能,获取上一条或下一条记录已经是一些工作了。方法如下:

select 
  thetime, 
  (select max(thetime) from mytable afore where afore.thetime < mytable.thetime) as afore_time,
  (select min(thetime) from mytable after where after.thetime > mytable.thetime) as after_time
from mytable;

基于此,我们可以构建整个查询,寻找间隙(即与上一条或下一条记录的时间差超过 30 分钟 = 1800 秒)。

select
  startrec.thetime as start_time,
  (
    select min(endrec.thetime) 
    from 
    (
      select 
        thetime, 
        coalesce(time_to_sec(timediff((select min(thetime) from mytable after where after.thetime > mytable.thetime), thetime)), 1801) > 1800 as gap
      from mytable
    ) endrec
    where gap
    and endrec.thetime >= startrec.thetime
  ) as end_time
from
(
  select 
    thetime, 
    coalesce(time_to_sec(timediff(thetime, (select max(thetime) from mytable afore where afore.thetime < mytable.thetime))), 1801) > 1800 as gap
  from mytable
) startrec
where gap;

SQL fiddle: http://www.sqlfiddle.com/#!2/d307b/20.

所以它不是 MySQL,而是 Hive。我不知道 Hive,但如果它支持 LAG,就像你说的,试试这个 PostgreSQL 查询。您可能需要更改时差计算,这通常与一个 dbms 不同。

select min(thetime) as start_time, max(thetime) as end_time
from
(
  select thetime, count(gap) over (rows between unbounded preceding and current row) as groupid
  from
  (
    select thetime, case when thetime - lag(thetime) over (order by thetime) > interval '30 minutes' then 1 end as gap
    from mytable
  ) times
) groups
group by groupid
order by min(thetime);

查询找到差距,然后使用 运行 总差距计数来构建组 ID,剩下的就是聚合。

SQL fiddle: http://www.sqlfiddle.com/#!17/8bc4a/6.