在 R 或 PostgreSQL 中形成时空接近轨迹组
Forming groups of spatio-temporally near trajectories in R or PostgreSQL
我正在使用 R 和 PostgreSQL 进行一些轨迹分析。为了形成连续位置在时空上接近的轨迹段组,我创建了以下 table。我仍然缺少的是 group_id
列,这就是我的问题所在。
bike_id1 datetime bike_id2 near group_id
1 2016-05-28 11:00:00 2 TRUE 1
1 2016-05-28 11:00:05 2 TRUE 1
1 2016-05-28 11:00:10 2 FALSE NA
[...]
2 2016-05-28 11:00:05 3 TRUE 1
2 2016-05-28 11:00:10 3 TRUE 1
这是每个轨迹相互之间的多重比较结果(所有组合没有重复)和 datetime
上的内部连接(始终以 5 秒的倍数采样)。它表明对于某些位置,自行车 1 和 2 是同时采样的并且在空间上很近(某个任意阈值)。
现在我想为两辆自行车在时空上接近的路段提供唯一 ID (group_id
)。 这就是我被困的地方:我希望 group_id
尊重具有多个轨迹的群体。分配 group_id
的方法应该认识到,如果自行车 1 和 2 在 2016-05-28 11:00:05
处属于一组,那么如果 3 在同一时间戳接近 2(2016-05-28 11:00:05
).
R 或 PostgreSQL 中是否有可以帮助我完成此任务的工具? 运行 通过 table 循环似乎是错误的方法。
编辑:
正如@wildplasser 指出的那样,这似乎是一个传统上使用 SQL 解决的间隙和孤岛问题。他亲切地提供了一些示例数据,我稍微扩展了这些数据并将包含在问题中。
CREATE TABLE nearness
-- ( seq SERIAL NOT NULL UNIQUE -- surrogate for conveniance
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE)
,(4,5, '2016-05-28 11:00:00', FALSE)
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;
更新:[理解真正的问题后;-]找到自行车的等价组(set,bike_set ) 实际上是一个关系除法问题。在一组自行车中找到分段 (clust) 的开始和结束与第一次尝试基本相同。
- 集群存储在数组中:(我相信集群不会变得太大)
- 该数组是通过递归查询构建的:将与当前集群有一个共同成员的每对自行车合并到其中。
- 最后,数组包含所有 bike_id 在特定时间碰巧触手可及的内容。
- (加上一些中间行,稍后需要被
uniq
CTE 抑制)
- 其余的是时间序列中或多或少的标准间隙检测。
注意:代码信任 (bike2 > bike1)
。这是保持数组排序并因此规范化所必需的。不能保证实际内容是规范的,因为无法保证递归查询中的添加顺序。这可能需要一些额外的工作。
CREATE TABLE nearness
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE) -- <<-- these False-records serve no pupose
,(4,5, '2016-05-28 11:00:00', FALSE) -- <<-- result would be the same without them
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;
-- Recursive union-find to glue together sets of bike_ids
-- ,occuring at the same moment.
-- Sets are represented as {ordered,unique} arrays here
WITH RECURSIVE wood AS (
WITH omg AS (
SELECT bike1 ,bike2,stamp
, row_number() OVER(ORDER BY bike1,bike2,stamp) AS seq
, ARRAY[bike1,bike2]::integer[] AS arr
FROM nearness n WHERE near = True
)
-- Find all existing combinations of bikes
SELECT o1.stamp, o1.seq
, ARRAY[o1.bike1,o1.bike2]::integer[] AS arr
FROM omg o1
UNION ALL
SELECT o2.stamp, o2.seq -- avoid duplicates inside the array
, CASE when o2.bike1 = ANY(w.arr) THEN w.arr || o2.bike2
ELSE w.arr || o2.bike1 END AS arr
FROM omg o2
JOIN wood w
ON o2.stamp = w.stamp AND o2.seq > w.seq
AND (o2.bike1 = ANY(w.arr) OR o2.bike2 = ANY(w.arr))
AND NOT (o2.bike1 = ANY(w.arr) AND o2.bike2 = ANY(w.arr))
)
, uniq AS ( -- suppress partial sets caused by the recursive union-find buildup
SELECT * FROM wood w
WHERE NOT EXISTS (SELECT * FROM wood nx
WHERE nx.stamp = w.stamp
AND nx.arr @> w.arr AND nx.arr <> w.arr -- contains but not equal
)
)
, xsets AS ( -- make unique sets of bikes
SELECT DISTINCT arr
-- , MIN(seq) AS grp
FROM uniq
GROUP BY arr
)
, sets AS ( -- enumerate the sets of bikes
SELECT arr
, row_number() OVER () AS setnum
FROM xsets
)
, drag AS ( -- Detect beginning and end of segments of consecutive observations
SELECT u.* -- within a constant set of bike_ids
-- Edge-detection begin of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp < u.stamp
AND nx.stamp >= u.stamp - '5 sec'::interval
) AS is_first
-- Edge-detection end of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp > u.stamp
AND nx.stamp <= u.stamp + '5 sec'::interval
) AS is_last
, row_number() OVER(ORDER BY arr,stamp) AS nseq
FROM uniq u
)
, top AS ( -- id and groupnum for the start of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_first
)
, bot AS ( -- id and groupnum for the end of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_last
)
SELECT w.seq as orgseq -- results, please ...
, w.stamp
, g0.clust AS clust
, row_number() OVER(www) AS rn
, s.setnum, s.arr AS bike_set
FROM drag w
JOIN sets s ON s.arr = w.arr
JOIN top g0 ON g0.nseq <= w.seq
JOIN bot g1 ON g1.nseq >= w.seq AND g1.clust = g0.clust
WINDOW www AS (PARTITION BY g1.clust ORDER BY w.stamp)
ORDER BY g1.clust, w.stamp
;
结果:
orgseq | stamp | clust | rn | setnum | bike_set
--------+---------------------+-------+----+--------+----------
1 | 2016-05-28 11:00:00 | 1 | 1 | 1 | {1,2}
4 | 2016-05-28 11:00:20 | 3 | 1 | 1 | {1,2}
5 | 2016-05-28 11:00:25 | 3 | 2 | 1 | {1,2}
6 | 2016-05-28 11:00:05 | 4 | 1 | 3 | {1,2,3}
7 | 2016-05-28 11:00:10 | 4 | 2 | 3 | {1,2,3}
8 | 2016-05-28 11:00:10 | 4 | 3 | 2 | {4,5}
(6 rows)
我正在使用 R 和 PostgreSQL 进行一些轨迹分析。为了形成连续位置在时空上接近的轨迹段组,我创建了以下 table。我仍然缺少的是 group_id
列,这就是我的问题所在。
bike_id1 datetime bike_id2 near group_id
1 2016-05-28 11:00:00 2 TRUE 1
1 2016-05-28 11:00:05 2 TRUE 1
1 2016-05-28 11:00:10 2 FALSE NA
[...]
2 2016-05-28 11:00:05 3 TRUE 1
2 2016-05-28 11:00:10 3 TRUE 1
这是每个轨迹相互之间的多重比较结果(所有组合没有重复)和 datetime
上的内部连接(始终以 5 秒的倍数采样)。它表明对于某些位置,自行车 1 和 2 是同时采样的并且在空间上很近(某个任意阈值)。
现在我想为两辆自行车在时空上接近的路段提供唯一 ID (group_id
)。 这就是我被困的地方:我希望 group_id
尊重具有多个轨迹的群体。分配 group_id
的方法应该认识到,如果自行车 1 和 2 在 2016-05-28 11:00:05
处属于一组,那么如果 3 在同一时间戳接近 2(2016-05-28 11:00:05
).
R 或 PostgreSQL 中是否有可以帮助我完成此任务的工具? 运行 通过 table 循环似乎是错误的方法。
编辑: 正如@wildplasser 指出的那样,这似乎是一个传统上使用 SQL 解决的间隙和孤岛问题。他亲切地提供了一些示例数据,我稍微扩展了这些数据并将包含在问题中。
CREATE TABLE nearness
-- ( seq SERIAL NOT NULL UNIQUE -- surrogate for conveniance
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE)
,(4,5, '2016-05-28 11:00:00', FALSE)
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;
更新:[理解真正的问题后;-]找到自行车的等价组(set,bike_set ) 实际上是一个关系除法问题。在一组自行车中找到分段 (clust) 的开始和结束与第一次尝试基本相同。
- 集群存储在数组中:(我相信集群不会变得太大)
- 该数组是通过递归查询构建的:将与当前集群有一个共同成员的每对自行车合并到其中。
- 最后,数组包含所有 bike_id 在特定时间碰巧触手可及的内容。
- (加上一些中间行,稍后需要被
uniq
CTE 抑制) - 其余的是时间序列中或多或少的标准间隙检测。
注意:代码信任 (bike2 > bike1)
。这是保持数组排序并因此规范化所必需的。不能保证实际内容是规范的,因为无法保证递归查询中的添加顺序。这可能需要一些额外的工作。
CREATE TABLE nearness
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE) -- <<-- these False-records serve no pupose
,(4,5, '2016-05-28 11:00:00', FALSE) -- <<-- result would be the same without them
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;
-- Recursive union-find to glue together sets of bike_ids
-- ,occuring at the same moment.
-- Sets are represented as {ordered,unique} arrays here
WITH RECURSIVE wood AS (
WITH omg AS (
SELECT bike1 ,bike2,stamp
, row_number() OVER(ORDER BY bike1,bike2,stamp) AS seq
, ARRAY[bike1,bike2]::integer[] AS arr
FROM nearness n WHERE near = True
)
-- Find all existing combinations of bikes
SELECT o1.stamp, o1.seq
, ARRAY[o1.bike1,o1.bike2]::integer[] AS arr
FROM omg o1
UNION ALL
SELECT o2.stamp, o2.seq -- avoid duplicates inside the array
, CASE when o2.bike1 = ANY(w.arr) THEN w.arr || o2.bike2
ELSE w.arr || o2.bike1 END AS arr
FROM omg o2
JOIN wood w
ON o2.stamp = w.stamp AND o2.seq > w.seq
AND (o2.bike1 = ANY(w.arr) OR o2.bike2 = ANY(w.arr))
AND NOT (o2.bike1 = ANY(w.arr) AND o2.bike2 = ANY(w.arr))
)
, uniq AS ( -- suppress partial sets caused by the recursive union-find buildup
SELECT * FROM wood w
WHERE NOT EXISTS (SELECT * FROM wood nx
WHERE nx.stamp = w.stamp
AND nx.arr @> w.arr AND nx.arr <> w.arr -- contains but not equal
)
)
, xsets AS ( -- make unique sets of bikes
SELECT DISTINCT arr
-- , MIN(seq) AS grp
FROM uniq
GROUP BY arr
)
, sets AS ( -- enumerate the sets of bikes
SELECT arr
, row_number() OVER () AS setnum
FROM xsets
)
, drag AS ( -- Detect beginning and end of segments of consecutive observations
SELECT u.* -- within a constant set of bike_ids
-- Edge-detection begin of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp < u.stamp
AND nx.stamp >= u.stamp - '5 sec'::interval
) AS is_first
-- Edge-detection end of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp > u.stamp
AND nx.stamp <= u.stamp + '5 sec'::interval
) AS is_last
, row_number() OVER(ORDER BY arr,stamp) AS nseq
FROM uniq u
)
, top AS ( -- id and groupnum for the start of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_first
)
, bot AS ( -- id and groupnum for the end of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_last
)
SELECT w.seq as orgseq -- results, please ...
, w.stamp
, g0.clust AS clust
, row_number() OVER(www) AS rn
, s.setnum, s.arr AS bike_set
FROM drag w
JOIN sets s ON s.arr = w.arr
JOIN top g0 ON g0.nseq <= w.seq
JOIN bot g1 ON g1.nseq >= w.seq AND g1.clust = g0.clust
WINDOW www AS (PARTITION BY g1.clust ORDER BY w.stamp)
ORDER BY g1.clust, w.stamp
;
结果:
orgseq | stamp | clust | rn | setnum | bike_set
--------+---------------------+-------+----+--------+----------
1 | 2016-05-28 11:00:00 | 1 | 1 | 1 | {1,2}
4 | 2016-05-28 11:00:20 | 3 | 1 | 1 | {1,2}
5 | 2016-05-28 11:00:25 | 3 | 2 | 1 | {1,2}
6 | 2016-05-28 11:00:05 | 4 | 1 | 3 | {1,2,3}
7 | 2016-05-28 11:00:10 | 4 | 2 | 3 | {1,2,3}
8 | 2016-05-28 11:00:10 | 4 | 3 | 2 | {4,5}
(6 rows)