根据日期时间列中的间隔时间阈值修改一行的id值

Question

我正在研究 geolife dataset，它在文本文件 (.plt) 中包含带时间戳的用户 GPS 轨迹。每个文本文件都包含用户一次旅行的 GPS 点。因此，我使用 python 脚本将数据集导入 postgres。

因为文件是根据行程的开始时间用一串数字命名的（例如，下面table中包含行程的文件是20070920074804.plt），我给行程id (session_id) 文件名（不带扩展名）。这是 table trajectories.

中的原始 GPS

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |    20070920074804 | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |    20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |    20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |    20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |    20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |    20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

出于分析目的，我创建了另一个 table trips_metrics，我在其中根据 trajectories table 计算行程指标并将结果插入 trip_metrics。我计算的值包括行程距离 (haversine) 和持续时间 (start time - end time)。

然后我注意到一些奇怪的事情，一个用户走了 8hrs 的路程，但走了 321m 的距离。彻底浏览行程文件后，我注意到行程时间有跳跃，这表明行程中断（可能用户停留数小时然后继续）。上面的 table 中的 row 3 和 row 4 就是一个例子。

为了获得准确的行程时间，我需要将行程与这些情况分开，如果连续行之间的时间间隔超过 30 分钟，则应将其视为新行程（因此为新 ID）。

我打算在实际计算行程指标（即修改 trajectories table)。所以对于上面table中的例子，我想这样拆分：

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |  20070920074804   | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |  20070920074804   | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |  20070920074804   | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |  2007092007480402 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |  2007092007480402 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |  2007092007480402 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

请注意我如何为新行程分配 session_id（因为中间的时间超过 30 分钟）。

如何对 postgres 中的原始 GPS table (trajectories) 进行修改或更改？

编辑

A: @GMB 的答案中的第一个查询有效，但是，它给出了我在 new_session_id 列中新 session_id 的每一行。

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480402 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480403 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480404 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480405 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

预期结果：

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480401 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

想法是通过 old_session_id + 01 给 "emerging" 行程一个新的 ID。如果遇到另一个新出现的行程，则应分配给它 old_session_id + 02 等等。

B: 带有更新选项的第二个查询包含语法错误：

update trajectories t
from (
    select 
        t.*,
        case when sum(is_gap) over(partition by session_id order by timestamp) > 0
            then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
            else session_id
        end new_session_id
    from (
        select
            t.*,
            (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
        from trajectories t
    ) t
) t1
set session_id = t1.new_session_id
where t1.session_id = t.session_id and t1.timestamp = t.timestamp

ERROR:  syntax error at or near "from"
LINE 2: from (

Answer 1

您可以使用 lag()，一个累积总和来识别段，然后通过某种方式管理 session_id:

select (case when grp >= 1 then session_id * 100 + grp
             else session_id
        end) as new_session_id,
       t.*
from (select t.*,
             count(*) filter (where prev_ts < timestamp - interval '30 minute') over (partition by session_id, order by timestamp) as grp
      from (select t.*, 
                   lag(timestamp) over (partition by session_id order by timestamp) as prev_ts
            from trajectories t
           ) t
     ) t;

Here 是一个 db<>fiddle.

Answer 2

这是一个缺口和孤岛问题。您想要检测时间戳差异大于 30 分钟的连续行，然后相应地更改 session_id。

一个选项是使用 lag()，然后计算间隙的累积计数 - 然后您可以使用该信息计算新的 session_id:

select 
    t.*,
    case when sum(is_gap) over(partition by session_id order by timestamp) > 0
        then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
        else session_id
    end new_session_id
from (
    select
        t.*,
        (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
    from trajectories t
) t

如果需要，您可以将其转换为 update 语句：

update trajectories t
set session_id = t1.new_session_id
from (
    select 
        t.*,
        case when sum(is_gap) over(partition by session_id order by timestamp) > 0
            then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
            else session_id
        end new_session_id
    from (
        select
            t.*,
            (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
        from trajectories t
    ) t
) t1
where t1.session_id = t.session_id and t1.timestamp = t.timestamp

根据日期时间列中的间隔时间阈值修改一行的id值

Modifying the id value of a row based on interval time threshold in date time column

sql

postgresql

date

window-functions

gaps-and-islands