比较两个连续行中的时间戳，这两个行在 Big Query 中具有不同的 A 列值和相同的 B 列值

Question

伙计们，我有一个很大的查询结果，它显示了骑手（在 rider_id 列中）注销应用程序的时间（在 local_time 列中）（[=14 列中） =]), 因此列 event、"authentication_complete" 和 "logout".

有两个不同的值

event_date  rider_id    event                    local_time
20200329    100695      authentication_complete  20:07:09
20200329    100884      authentication_complete  12:00:51
20200329    100967      logout                   10:53:17
20200329    100967      authentication_complete  10:55:24
20200329    100967      logout                   11:03:28
20200329    100967      authentication_complete  11:03:47
20200329    101252      authentication_complete  7:55:21
20200329    101940      authentication_complete  8:58:44
20200329    101940      authentication_complete  17:19:57
20200329    102015      authentication_complete  14:20:27
20200329    102015      logout                   22:47:50
20200329    102015      authentication_complete  22:48:34

我想要实现的是对于每个退出的骑手，在一列中我想获得他们退出的时间，在另一列中我想获得事件的时间 "authentication_complete"那是在该骑手的注销事件之后。通过这种方式，我可以看到每个骑手离开应用程序的时间段。我想要获得的查询结果如下所示。

event_date  rider_id    time_of_logout  authentication_complete_right_after_the_logout
20200329    100967      10:53:17        10:55:24
20200329    100967      11:03:28        11:03:47
20200329    102015      22:47:50        22:48:34

这是一个非常不干净的数据集，到目前为止我能清洗这么多，但是到了这一步，我感觉很卡。我正在研究 lag() 之类的函数，但由于数据是 180,000 行，因此对于 rider_id 可以有多个名为 "logout" 的事件，对于 "authentication_complete" 可以有多个连续事件一样的rider_id，更让人费解。我真的很感激任何帮助。谢谢！

Answer 1

我想你想要 lead():

select event_date, rider_id, date, local_time as logout_date,
       authentication_date
from (select t.*,
             lead(local_time) over (partition by event_date, rider_id order by local_time) as authentication_date
      from t
     ) t
where event = 'logout';

这假定下一个事件确实是身份验证，如您的示例数据中所示。如果不是这种情况，您没有指定要做什么。

如果您特别想要下一个身份验证日期，那么您可以使用 min():

select event_date, rider_id, date, local_time as logout_date,
       authentication_date
from (select t.*,
             min(case when event = 'authentication_complete' then local_time end) over (partition by event_date, rider_id order by local_time desc) as authentication_date
      from t
     ) t
where event = 'logout';

比较两个连续行中的时间戳，这两个行在 Big Query 中具有不同的 A 列值和相同的 B 列值

comparing timestamps in two consecutive rows which have different values for column A and the same value for column B in Big Query

sql

google-bigquery

data-cleaning

data-wrangling