根据日期时间列中的间隔时间阈值修改一行的id值
Modifying the id value of a row based on interval time threshold in date time column
我正在研究 geolife
dataset,它在文本文件 (.plt
) 中包含带时间戳的用户 GPS 轨迹。每个文本文件都包含用户一次旅行的 GPS 点。因此,我使用 python 脚本将数据集导入 postgres
。
因为文件是根据行程的开始时间用一串数字命名的(例如,下面table中包含行程的文件是20070920074804.plt
),我给行程id (session_id
) 文件名(不带扩展名)。这是 table trajectories
.
中的原始 GPS
user_id | session_id | timestamp | lat | lon | alt
---------+-------------------+------------------------+-----------+------------+-----
11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71
11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87
11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87
11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62
11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62
11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62
出于分析目的,我创建了另一个 table trips_metrics
,我在其中根据 trajectories
table 计算行程指标并将结果插入 trip_metrics
。我计算的值包括行程距离 (haversine
) 和持续时间 (start time - end time
)。
然后我注意到一些奇怪的事情,一个用户走了 8hrs
的路程,但走了 321m
的距离。彻底浏览行程文件后,我注意到行程时间有跳跃,这表明行程中断(可能用户停留数小时然后继续)。上面的 table 中的 row 3
和 row 4
就是一个例子。
为了获得准确的行程时间,我需要将行程与这些情况分开,如果连续行之间的时间间隔超过 30 分钟,则应将其视为新行程(因此为新 ID)。
我打算在实际计算行程指标(即修改 trajectories
table)。
所以对于上面table中的例子,我想这样拆分:
user_id | session_id | timestamp | lat | lon | alt
---------+-------------------+------------------------+-----------+------------+-----
11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71
11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87
11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87
11 | 2007092007480402 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62
11 | 2007092007480402 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62
11 | 2007092007480402 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62
请注意我如何为新行程分配 session_id
(因为中间的时间超过 30 分钟)。
如何对 postgres
中的原始 GPS table (trajectories
) 进行修改或更改?
编辑
A: @GMB 的答案中的第一个查询有效,但是,它给出了我在 new_session_id
列中新 session_id
的每一行。
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id | session_id | timestamp | lat | lon | alt | is_gap | new_session_id |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| 11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87 | 1 | 2007092007480402 |
| 11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62 | 1 | 2007092007480403 |
| 11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62 | 1 | 2007092007480404 |
| 11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62 | 1 | 2007092007480405 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
预期结果:
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id | session_id | timestamp | lat | lon | alt | is_gap | new_session_id |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| 11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62 | 1 | 2007092007480401 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
想法是通过 old_session_id + 01
给 "emerging" 行程一个新的 ID。如果遇到另一个新出现的行程,则应分配给它 old_session_id + 02
等等。
B: 带有更新选项的第二个查询包含语法错误:
update trajectories t
from (
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
) t1
set session_id = t1.new_session_id
where t1.session_id = t.session_id and t1.timestamp = t.timestamp
ERROR: syntax error at or near "from"
LINE 2: from (
您可以使用 lag()
,一个累积总和来识别段,然后通过某种方式管理 session_id
:
select (case when grp >= 1 then session_id * 100 + grp
else session_id
end) as new_session_id,
t.*
from (select t.*,
count(*) filter (where prev_ts < timestamp - interval '30 minute') over (partition by session_id, order by timestamp) as grp
from (select t.*,
lag(timestamp) over (partition by session_id order by timestamp) as prev_ts
from trajectories t
) t
) t;
Here 是一个 db<>fiddle.
这是一个缺口和孤岛问题。您想要检测时间戳差异大于 30 分钟的连续行,然后相应地更改 session_id
。
一个选项是使用 lag()
,然后计算间隙的累积计数 - 然后您可以使用该信息计算新的 session_id
:
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
如果需要,您可以将其转换为 update
语句:
update trajectories t
set session_id = t1.new_session_id
from (
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
) t1
where t1.session_id = t.session_id and t1.timestamp = t.timestamp
我正在研究 geolife
dataset,它在文本文件 (.plt
) 中包含带时间戳的用户 GPS 轨迹。每个文本文件都包含用户一次旅行的 GPS 点。因此,我使用 python 脚本将数据集导入 postgres
。
因为文件是根据行程的开始时间用一串数字命名的(例如,下面table中包含行程的文件是20070920074804.plt
),我给行程id (session_id
) 文件名(不带扩展名)。这是 table trajectories
.
user_id | session_id | timestamp | lat | lon | alt
---------+-------------------+------------------------+-----------+------------+-----
11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71
11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87
11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87
11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62
11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62
11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62
出于分析目的,我创建了另一个 table trips_metrics
,我在其中根据 trajectories
table 计算行程指标并将结果插入 trip_metrics
。我计算的值包括行程距离 (haversine
) 和持续时间 (start time - end time
)。
然后我注意到一些奇怪的事情,一个用户走了 8hrs
的路程,但走了 321m
的距离。彻底浏览行程文件后,我注意到行程时间有跳跃,这表明行程中断(可能用户停留数小时然后继续)。上面的 table 中的 row 3
和 row 4
就是一个例子。
为了获得准确的行程时间,我需要将行程与这些情况分开,如果连续行之间的时间间隔超过 30 分钟,则应将其视为新行程(因此为新 ID)。
我打算在实际计算行程指标(即修改 trajectories
table)。
所以对于上面table中的例子,我想这样拆分:
user_id | session_id | timestamp | lat | lon | alt
---------+-------------------+------------------------+-----------+------------+-----
11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71
11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87
11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87
11 | 2007092007480402 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62
11 | 2007092007480402 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62
11 | 2007092007480402 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62
请注意我如何为新行程分配 session_id
(因为中间的时间超过 30 分钟)。
如何对 postgres
中的原始 GPS table (trajectories
) 进行修改或更改?
编辑
A: @GMB 的答案中的第一个查询有效,但是,它给出了我在 new_session_id
列中新 session_id
的每一行。
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id | session_id | timestamp | lat | lon | alt | is_gap | new_session_id |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| 11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87 | 1 | 2007092007480402 |
| 11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62 | 1 | 2007092007480403 |
| 11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62 | 1 | 2007092007480404 |
| 11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62 | 1 | 2007092007480405 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
预期结果:
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id | session_id | timestamp | lat | lon | alt | is_gap | new_session_id |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| 11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62 | 1 | 2007092007480401 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
想法是通过 old_session_id + 01
给 "emerging" 行程一个新的 ID。如果遇到另一个新出现的行程,则应分配给它 old_session_id + 02
等等。
B: 带有更新选项的第二个查询包含语法错误:
update trajectories t
from (
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
) t1
set session_id = t1.new_session_id
where t1.session_id = t.session_id and t1.timestamp = t.timestamp
ERROR: syntax error at or near "from"
LINE 2: from (
您可以使用 lag()
,一个累积总和来识别段,然后通过某种方式管理 session_id
:
select (case when grp >= 1 then session_id * 100 + grp
else session_id
end) as new_session_id,
t.*
from (select t.*,
count(*) filter (where prev_ts < timestamp - interval '30 minute') over (partition by session_id, order by timestamp) as grp
from (select t.*,
lag(timestamp) over (partition by session_id order by timestamp) as prev_ts
from trajectories t
) t
) t;
Here 是一个 db<>fiddle.
这是一个缺口和孤岛问题。您想要检测时间戳差异大于 30 分钟的连续行,然后相应地更改 session_id
。
一个选项是使用 lag()
,然后计算间隙的累积计数 - 然后您可以使用该信息计算新的 session_id
:
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
如果需要,您可以将其转换为 update
语句:
update trajectories t
set session_id = t1.new_session_id
from (
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
) t1
where t1.session_id = t.session_id and t1.timestamp = t.timestamp