从 GPS 日志计算用户的距离和持续时间
Calculating user's distance and duration from GPS logs
我正在为 Beijing city 使用人们移动的 GPS 数据集。在我的原始 GPS 中 table trajectories
是所有用户的 GPS 序列:
CREATE TABLE trajectories
(
user_id integer,
session_id bigint NOT NULL,
"timestamp" timestamp with time zone NOT NULL,
lat double precision NOT NULL,
lon double precision NOT NULL,
alt double precision,
CONSTRAINT trajectories_pkey PRIMARY KEY (session_id, "timestamp")
);
SELECT * FROM trajectories ORDER BY user_id, timestamp LIMIT 10;
user_id | session_id | timestamp | lat | lon | alt
---------+----------------+------------------------+-----------+------------+-----
1 | 20081023025304 | 2008-10-23 02:53:04+01 | 39.984702 | 116.318417 | 492
1 | 20081023025304 | 2008-10-23 02:53:10+01 | 39.984683 | 116.31845 | 492
1 | 20081023025304 | 2008-10-23 02:53:15+01 | 39.984686 | 116.318417 | 492
1 | 20081023025304 | 2008-10-23 02:53:20+01 | 39.984688 | 116.318385 | 492
1 | 20081023025304 | 2008-10-23 02:53:25+01 | 39.984655 | 116.318263 | 492
1 | 20081023025304 | 2008-10-23 02:53:30+01 | 39.984611 | 116.318026 | 493
1 | 20081023025304 | 2008-10-23 02:53:35+01 | 39.984608 | 116.317761 | 493
1 | 20081023025304 | 2008-10-23 02:53:40+01 | 39.984563 | 116.317517 | 496
1 | 20081023025304 | 2008-10-23 02:53:45+01 | 39.984539 | 116.317294 | 500
1 | 20081023025304 | 2008-10-23 02:53:50+01 | 39.984606 | 116.317065 | 505
(10 rows)
上面的 SELECT 查询显示了 user 1
的 GPS 点序列,从当前行程的起点 (session_id=20081023025304
) 开始。我想使用此 table 中的原始数据将计算出的行程指标插入到新的 table 中,我定义为:
CREATE TABLE trip_metrics(
user_id INT,
session_id BIGINT,
lat_start DOUBLE PRECISION,
lat_end DOUBLE PRECISION,
lon_start DOUBLE PRECISION,
lon_end DOUBLE PRECISION,
trip_starttime timestamp,
trip_endtime timestamp,
trip_duration DOUBLE PRECISION,
trip_distance DOUBLE PRECISION,
PRIMARY KEY (user_id, session_id, trip_starttime)
);
这个trip_metrics
TABLE的重点是存储分析结果,所以lat_start, lon_start
取起始位置lat, lon
的值(在给定的例子中: 39.984702, 116.318417
), trip_starttimestamp
需要开始时间(在本例中为 2008-10-23 02:53:04+01
),因此 lat_end, lon_end, trip_endtime
分别。
最后使用 lat_start/end, lon_start/end
来计算该用户在本次旅行中走过的距离。最终结果应该是这样的:
+---------+----------------+-----------+-----------+------------+------------+------------------------+------------------------+---------------+---------------+
| user_id | session_id | lat_start | lat_end | lon_start | lon_end | trip_starttime | trip_endtime | trip_duration | trip_distance |
+---------+----------------+-----------+-----------+------------+------------+------------------------+------------------------+---------------+---------------+
| 1 | 20081023025304 | 39.984702 | 39.984606 | 116.318417 | 116.317065 | 2008-10-23 02:53:04+01 | 2008-10-23 02:53:50+01 | | |
+---------+----------------+-----------+-----------+------------+------------+------------------------+------------------------+---------------+---------------+
用trip_duration
和trip_distance
的值计算出来(当然trip_duration
的值就是trip_endtime - trip_starttime
)。
我研究了几天,想着如何在 PostgrSQL
数据库中过滤北京市内的行程 latitude (39.85 - 40.05)
和 longitude (116.25 - 116.5)
作为一些行程跨越了城市。我创建了一个 db-fiddle here 包含该用户 2 次旅行的 GPS 点(每次 10 点)。
我将不胜感激任何解决此问题的指南,以在我当前的研究中取得进展。
编辑
遇到这个用haversine公式计算距离的函数。我创建了这个函数,但我不确定如何使用它来获取 trip_distance
值。
CREATE OR REPLACE FUNCTION distance(
lat1 double precision,
lon1 double precision,
lat2 double precision,
lon2 double precision)
RETURNS double precision AS
$BODY$
DECLARE
R integer = 6371e3; -- Meters
rad double precision = 0.01745329252;
φ1 double precision = lat1 * rad;
φ2 double precision = lat2 * rad;
Δφ double precision = (lat2-lat1) * rad;
Δλ double precision = (lon2-lon1) * rad;
a double precision = sin(Δφ/2) * sin(Δφ/2) + cos(φ1) * cos(φ2) * sin(Δλ/2) * sin(Δλ/2);
c double precision = 2 * atan2(sqrt(a), sqrt(1-a));
BEGIN
RETURN R * c;
END
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
要更轻松地计算距离,您必须安装 PostGIS extension
,正如您在标签中所建议的那样:
CREATE EXTENSION postgis;
函数 ST_Distance
就是您要查找的内容,例如 (quick&dirty):
WITH j AS (
SELECT user_id, session_id,
max(timestamp ORDER BY timestamp),
min(timestamp ORDER BY timestamp)
FROM trajectories t
GROUP BY user_id,session_id
)
SELECT
s.user_id,s.session_id,
lat_start,lon_start,
lat_end,lon_end,
trip_starttime,
trip_endtime,
age(trip_endtime,trip_starttime),
ST_Distance(
ST_MakePoint(lon_start,lat_start)::geography,
ST_MakePoint(lon_end,lat_end)::geography) AS trip_distance
FROM
(SELECT
j.user_id, j.session_id,
t.timestamp AS trip_starttime,
lat AS lat_start, lon AS lon_start FROM j
JOIN trajectories t ON t.timestamp = j.min
AND t.session_id = j.session_id AND t.user_id = j.user_id) s,
(SELECT
j.user_id, j.session_id,
t.timestamp AS trip_endtime,
lat AS lat_end,lon AS lon_end FROM j
JOIN trajectories t ON t.timestamp = j.max
AND t.session_id = j.session_id AND t.user_id = j.user_id) e
WHERE s.user_id = e.user_id AND s.session_id = e.session_id;
user_id | session_id | lat_start | lon_start | lat_end | lon_end | trip_starttime | trip_endtime | age | trip_distance
---------+----------------+-----------+-----------+-----------+------------+------------------------+------------------------+----------+------------------
1 | 20081023025304 | 39.984702 | 16.318417 | 39.984606 | 116.317065 | 2008-10-23 03:53:04+02 | 2008-10-23 03:53:50+02 | 00:00:46 | 8012597.30391588
旁注:将经度和纬度存储在不同的列中几乎总是一个坏主意。如果可能,将它们存储到几何或地理列中。乍一看似乎有必要,但 PostGIS 确实提供了大量 kickass functions
!
延伸阅读:
我正在为 Beijing city 使用人们移动的 GPS 数据集。在我的原始 GPS 中 table trajectories
是所有用户的 GPS 序列:
CREATE TABLE trajectories
(
user_id integer,
session_id bigint NOT NULL,
"timestamp" timestamp with time zone NOT NULL,
lat double precision NOT NULL,
lon double precision NOT NULL,
alt double precision,
CONSTRAINT trajectories_pkey PRIMARY KEY (session_id, "timestamp")
);
SELECT * FROM trajectories ORDER BY user_id, timestamp LIMIT 10;
user_id | session_id | timestamp | lat | lon | alt
---------+----------------+------------------------+-----------+------------+-----
1 | 20081023025304 | 2008-10-23 02:53:04+01 | 39.984702 | 116.318417 | 492
1 | 20081023025304 | 2008-10-23 02:53:10+01 | 39.984683 | 116.31845 | 492
1 | 20081023025304 | 2008-10-23 02:53:15+01 | 39.984686 | 116.318417 | 492
1 | 20081023025304 | 2008-10-23 02:53:20+01 | 39.984688 | 116.318385 | 492
1 | 20081023025304 | 2008-10-23 02:53:25+01 | 39.984655 | 116.318263 | 492
1 | 20081023025304 | 2008-10-23 02:53:30+01 | 39.984611 | 116.318026 | 493
1 | 20081023025304 | 2008-10-23 02:53:35+01 | 39.984608 | 116.317761 | 493
1 | 20081023025304 | 2008-10-23 02:53:40+01 | 39.984563 | 116.317517 | 496
1 | 20081023025304 | 2008-10-23 02:53:45+01 | 39.984539 | 116.317294 | 500
1 | 20081023025304 | 2008-10-23 02:53:50+01 | 39.984606 | 116.317065 | 505
(10 rows)
上面的 SELECT 查询显示了 user 1
的 GPS 点序列,从当前行程的起点 (session_id=20081023025304
) 开始。我想使用此 table 中的原始数据将计算出的行程指标插入到新的 table 中,我定义为:
CREATE TABLE trip_metrics(
user_id INT,
session_id BIGINT,
lat_start DOUBLE PRECISION,
lat_end DOUBLE PRECISION,
lon_start DOUBLE PRECISION,
lon_end DOUBLE PRECISION,
trip_starttime timestamp,
trip_endtime timestamp,
trip_duration DOUBLE PRECISION,
trip_distance DOUBLE PRECISION,
PRIMARY KEY (user_id, session_id, trip_starttime)
);
这个trip_metrics
TABLE的重点是存储分析结果,所以lat_start, lon_start
取起始位置lat, lon
的值(在给定的例子中: 39.984702, 116.318417
), trip_starttimestamp
需要开始时间(在本例中为 2008-10-23 02:53:04+01
),因此 lat_end, lon_end, trip_endtime
分别。
最后使用 lat_start/end, lon_start/end
来计算该用户在本次旅行中走过的距离。最终结果应该是这样的:
+---------+----------------+-----------+-----------+------------+------------+------------------------+------------------------+---------------+---------------+
| user_id | session_id | lat_start | lat_end | lon_start | lon_end | trip_starttime | trip_endtime | trip_duration | trip_distance |
+---------+----------------+-----------+-----------+------------+------------+------------------------+------------------------+---------------+---------------+
| 1 | 20081023025304 | 39.984702 | 39.984606 | 116.318417 | 116.317065 | 2008-10-23 02:53:04+01 | 2008-10-23 02:53:50+01 | | |
+---------+----------------+-----------+-----------+------------+------------+------------------------+------------------------+---------------+---------------+
用trip_duration
和trip_distance
的值计算出来(当然trip_duration
的值就是trip_endtime - trip_starttime
)。
我研究了几天,想着如何在 PostgrSQL
数据库中过滤北京市内的行程 latitude (39.85 - 40.05)
和 longitude (116.25 - 116.5)
作为一些行程跨越了城市。我创建了一个 db-fiddle here 包含该用户 2 次旅行的 GPS 点(每次 10 点)。
我将不胜感激任何解决此问题的指南,以在我当前的研究中取得进展。
编辑
遇到这个用haversine公式计算距离的函数。我创建了这个函数,但我不确定如何使用它来获取 trip_distance
值。
CREATE OR REPLACE FUNCTION distance(
lat1 double precision,
lon1 double precision,
lat2 double precision,
lon2 double precision)
RETURNS double precision AS
$BODY$
DECLARE
R integer = 6371e3; -- Meters
rad double precision = 0.01745329252;
φ1 double precision = lat1 * rad;
φ2 double precision = lat2 * rad;
Δφ double precision = (lat2-lat1) * rad;
Δλ double precision = (lon2-lon1) * rad;
a double precision = sin(Δφ/2) * sin(Δφ/2) + cos(φ1) * cos(φ2) * sin(Δλ/2) * sin(Δλ/2);
c double precision = 2 * atan2(sqrt(a), sqrt(1-a));
BEGIN
RETURN R * c;
END
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
要更轻松地计算距离,您必须安装 PostGIS extension
,正如您在标签中所建议的那样:
CREATE EXTENSION postgis;
函数 ST_Distance
就是您要查找的内容,例如 (quick&dirty):
WITH j AS (
SELECT user_id, session_id,
max(timestamp ORDER BY timestamp),
min(timestamp ORDER BY timestamp)
FROM trajectories t
GROUP BY user_id,session_id
)
SELECT
s.user_id,s.session_id,
lat_start,lon_start,
lat_end,lon_end,
trip_starttime,
trip_endtime,
age(trip_endtime,trip_starttime),
ST_Distance(
ST_MakePoint(lon_start,lat_start)::geography,
ST_MakePoint(lon_end,lat_end)::geography) AS trip_distance
FROM
(SELECT
j.user_id, j.session_id,
t.timestamp AS trip_starttime,
lat AS lat_start, lon AS lon_start FROM j
JOIN trajectories t ON t.timestamp = j.min
AND t.session_id = j.session_id AND t.user_id = j.user_id) s,
(SELECT
j.user_id, j.session_id,
t.timestamp AS trip_endtime,
lat AS lat_end,lon AS lon_end FROM j
JOIN trajectories t ON t.timestamp = j.max
AND t.session_id = j.session_id AND t.user_id = j.user_id) e
WHERE s.user_id = e.user_id AND s.session_id = e.session_id;
user_id | session_id | lat_start | lon_start | lat_end | lon_end | trip_starttime | trip_endtime | age | trip_distance
---------+----------------+-----------+-----------+-----------+------------+------------------------+------------------------+----------+------------------
1 | 20081023025304 | 39.984702 | 16.318417 | 39.984606 | 116.317065 | 2008-10-23 03:53:04+02 | 2008-10-23 03:53:50+02 | 00:00:46 | 8012597.30391588
旁注:将经度和纬度存储在不同的列中几乎总是一个坏主意。如果可能,将它们存储到几何或地理列中。乍一看似乎有必要,但 PostGIS 确实提供了大量 kickass functions
!
延伸阅读: