两个表之间最近点的唯一分配
Unique assignment of closest points between two tables
在我安装了 PostGis 2.2.0 的 Postgres 9.5 数据库中,我有两个带有几何数据的 tables (点),我想将一个 table 的点分配给另一个 table 的点,但我不希望 buildings.gid
被分配两次。一旦分配了一个 buildings.gid
,就不应将其分配给另一个 pvanlagen.buildid
。
Table 定义
buildings
:
CREATE TABLE public.buildings (
gid numeric NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr numeric,
capac numeric,
instdate date,
pvid numeric,
dist numeric,
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
);
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);
pvanlagen
:
CREATE TABLE public.pvanlagen (
gid integer NOT NULL DEFAULT nextval('pv_bis2010_bayern_wgs84_gid_seq'::regclass),
tso character varying(254),
tso_number numeric(10,0),
system_ope character varying(254),
system_key character varying(254),
location character varying(254),
postal_cod numeric(10,0),
street character varying(254),
capacity numeric,
voltage_le character varying(254),
energy_sou character varying(254),
beginning_ date,
end_operat character varying(254),
id numeric(10,0),
kkz numeric(10,0),
geom geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
buildid numeric,
dist numeric,
trans boolean,
CONSTRAINT pv_bis2010_bayern_wgs84_pkey PRIMARY KEY (gid),
CONSTRAINT pvanlagen_buildid_fkey FOREIGN KEY (buildid)
REFERENCES public.buildings (gid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid)
);
CREATE INDEX pv_bis2010_bayern_wgs84_geom_idx
ON public.pvanlagen
USING gist
(geom);
查询
我的想法是在 buildings
table 中添加一个 boolean
列 pv
,这是在分配 buildings.gid
时设置的:
UPDATE pvanlagen
SET buildid=buildings.gid, dist='50'
FROM buildings
WHERE buildid IS NULL
AND buildings.pv is NULL
AND pvanlagen.gemname=buildings.gemname
AND ST_Distance(ST_Transform(pvanlagen.geom,31467)
,ST_Transform(buildings.centroid,31467))<50;
UPDATE buildings
SET pv=true
FROM pvanlagen
WHERE buildings.gid=pvanlagen.buildid;
我在 buildings
中测试了 50 行,但申请所有这些行需要很长时间。我有 3.200.000 座建筑 和 260.000 PV.
最近的建筑物的gid
将被分配。如果在平局的情况下,分配哪个 gid
应该无关紧要。如果需要定格,可以拿楼下gid
.
50 米本来是一个极限。我用ST_Distance()
因为它是returns最小距离,应该在50米以内。后来我又提了好多次,直到每一个PV Anlage都被赋值了。
建筑物和 PV 被分配到各自的区域 (gemname
)。这应该会使分配更便宜,因为我知道最近的建筑物必须在同一区域内 (gemname
)。
我在收到以下反馈后尝试了这个查询:
UPDATE pvanlagen p1
SET buildid = buildings.gid
, dist = buildings.dist
FROM (
SELECT DISTINCT ON (b.gid)
p.id, b.gid, b.dist::numeric
FROM (
SELECT id, ST_Transform(geom, 31467)
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid, ST_Distance(ST_Transform(p1.geom, 31467), ST_Transform(b.centroid, 31467)) AS dist
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid
WHERE p1.buildid IS NULL
AND b.gemname = p1.gemname
ORDER BY ST_Transform(p1.geom, 31467) <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b
ORDER BY b.gid, b.dist, p.id -- tie breaker
) x, buildings
WHERE p1.id = x.id;
但是 returns 和 0 rows affected in 234 ms execution time
。
我哪里错了?
Table 架构
要执行您的规则,只需声明 pvanlagen.buildid
UNIQUE
:
ALTER TABLE pvanlagen ADD CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid);
正如您的更新显示的那样,building.gid
是 PK。要同时强制引用完整性,请将 FOREIGN KEY
constraint 添加到 buildings.gid
.
你现在已经实现了。但是在 添加这些约束之前 运行 下面的大 UPDATE
会更有效率。
您的 table 定义还有很多需要改进的地方。其一,buildings.gid
和 pvanlagen.buildid
应该是类型 integer
(或者可能是 bigint
,如果你燃烧 很多 的 PK 值). numeric
是昂贵的废话。
让我们关注核心问题:
查找最近建筑物的基本查询
这个案子并不像看起来那么简单。这是一个 "nearest neighbour" 问题,唯一分配的额外复杂性。
此查询为每个 PV(PV Anlage 的缩写 - pvanlagen
中的行)查找最近的 one 建筑物,但两者均未分配,但:
SELECT pv_gid, b_gid, dist
FROM (
SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid AS b_gid
, round(ST_Distance(p.geom31467
, ST_Transform(b.centroid, 31467))::numeric, 2) AS dist -- see below
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid -- also not assigned ...
WHERE p1.buildid IS NULL -- ... yet
-- AND p.gemname = b.gemname -- not needed for performance, see below
ORDER BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b;
为了使这个查询更快,你需要 buildings
上的空间功能 GiST 索引来使它 很多更快:
CREATE INDEX build_centroid_gix ON buildings USING gist (ST_Transform(centroid, 31467));
不确定为什么你不
更多解释的相关答案:
进一步阅读:
- http://workshops.boundlessgeo.com/postgis-intro/knn.html
- http://www.postgresonline.com/journal/archives/306-KNN-GIST-with-a-Lateral-twist-Coming-soon-to-a-database-near-you.html
有了索引,我们不需要为了性能而将匹配限制为相同的 gemname
。仅当这是要强制执行的实际规则时才这样做。如果必须时刻观察,则在FK约束中包含该列:
- Restrict foreign key relationship to rows of related subtypes
剩余问题
我们可以在UPDATE
语句中使用上面的查询语句。每个 PV 仅使用一次,但多个 PV 可能仍会找到 同一建筑物 最近的建筑物。每个建筑物只允许 一个 PV。那么你会如何解决这个问题?
换句话说,你会如何在这里分配对象?
简单的解决方案
一个简单的解决方案是:
UPDATE pvanlagen p1
SET buildid = sub.b_gid
, dist = sub.dist -- actual distance
FROM (
SELECT DISTINCT ON (b_gid)
pv_gid, b_gid, dist
FROM (
SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid AS b_gid
, round(ST_Distance(p.geom31467
, ST_Transform(b.centroid, 31467))::numeric, 2) AS dist -- see below
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid -- also not assigned ...
WHERE p1.buildid IS NULL -- ... yet
-- AND p.gemname = b.gemname -- not needed for performance, see below
ORDER BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b
ORDER BY b_gid, dist, pv_gid -- tie breaker
) sub
WHERE p1.gid = sub.pv_gid;
我使用 DISTINCT ON (b_gid)
将每个建筑物精确地减少到 一个 行,选择距离最短的 PV。详情:
- Select first row in each GROUP BY group?
对于距离最近的多个 PV 的任何建筑物,仅分配最近的 PV。 PK 列 gid
(别名 pv_gid
)作为平局决胜局,如果两者同样接近。在这种情况下,一些 PV 从更新中删除并保持 未分配。 重复查询,直到分配完所有 PV。
虽然这仍然是一个简单的算法。看看我上面的图表,这将建筑物 4 分配给 PV 4,将建筑物 5 分配给 PV 5,而 4-5 和 5-4 可能是总体上更好的解决方案 ...
旁白:键入 dist
列
目前您使用numeric
。您的原始查询分配了一个常量 integer
,在 numeric
.
中没有意义
在我的新查询中 ST_Distance()
returns the actual distance in meters as double precision
。如果我们简单地分配我们在 numeric
数据类型中得到 15 个左右的小数位,并且数字不是 that 刚开始的数字。我严重怀疑你想浪费存储空间。
我宁愿从计算中保存原来的double precision
。或者,更好,根据需要四舍五入。如果米足够精确,只需投射并保存 integer
(自动舍入数字)。或先乘以100以节省cm:
(ST_Distance(...) * 100)::int
在我安装了 PostGis 2.2.0 的 Postgres 9.5 数据库中,我有两个带有几何数据的 tables (点),我想将一个 table 的点分配给另一个 table 的点,但我不希望 buildings.gid
被分配两次。一旦分配了一个 buildings.gid
,就不应将其分配给另一个 pvanlagen.buildid
。
Table 定义
buildings
:
CREATE TABLE public.buildings (
gid numeric NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr numeric,
capac numeric,
instdate date,
pvid numeric,
dist numeric,
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
);
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);
pvanlagen
:
CREATE TABLE public.pvanlagen (
gid integer NOT NULL DEFAULT nextval('pv_bis2010_bayern_wgs84_gid_seq'::regclass),
tso character varying(254),
tso_number numeric(10,0),
system_ope character varying(254),
system_key character varying(254),
location character varying(254),
postal_cod numeric(10,0),
street character varying(254),
capacity numeric,
voltage_le character varying(254),
energy_sou character varying(254),
beginning_ date,
end_operat character varying(254),
id numeric(10,0),
kkz numeric(10,0),
geom geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
buildid numeric,
dist numeric,
trans boolean,
CONSTRAINT pv_bis2010_bayern_wgs84_pkey PRIMARY KEY (gid),
CONSTRAINT pvanlagen_buildid_fkey FOREIGN KEY (buildid)
REFERENCES public.buildings (gid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid)
);
CREATE INDEX pv_bis2010_bayern_wgs84_geom_idx
ON public.pvanlagen
USING gist
(geom);
查询
我的想法是在 buildings
table 中添加一个 boolean
列 pv
,这是在分配 buildings.gid
时设置的:
UPDATE pvanlagen
SET buildid=buildings.gid, dist='50'
FROM buildings
WHERE buildid IS NULL
AND buildings.pv is NULL
AND pvanlagen.gemname=buildings.gemname
AND ST_Distance(ST_Transform(pvanlagen.geom,31467)
,ST_Transform(buildings.centroid,31467))<50;
UPDATE buildings
SET pv=true
FROM pvanlagen
WHERE buildings.gid=pvanlagen.buildid;
我在 buildings
中测试了 50 行,但申请所有这些行需要很长时间。我有 3.200.000 座建筑 和 260.000 PV.
最近的建筑物的gid
将被分配。如果在平局的情况下,分配哪个 gid
应该无关紧要。如果需要定格,可以拿楼下gid
.
50 米本来是一个极限。我用ST_Distance()
因为它是returns最小距离,应该在50米以内。后来我又提了好多次,直到每一个PV Anlage都被赋值了。
建筑物和 PV 被分配到各自的区域 (gemname
)。这应该会使分配更便宜,因为我知道最近的建筑物必须在同一区域内 (gemname
)。
我在收到以下反馈后尝试了这个查询:
UPDATE pvanlagen p1
SET buildid = buildings.gid
, dist = buildings.dist
FROM (
SELECT DISTINCT ON (b.gid)
p.id, b.gid, b.dist::numeric
FROM (
SELECT id, ST_Transform(geom, 31467)
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid, ST_Distance(ST_Transform(p1.geom, 31467), ST_Transform(b.centroid, 31467)) AS dist
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid
WHERE p1.buildid IS NULL
AND b.gemname = p1.gemname
ORDER BY ST_Transform(p1.geom, 31467) <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b
ORDER BY b.gid, b.dist, p.id -- tie breaker
) x, buildings
WHERE p1.id = x.id;
但是 returns 和 0 rows affected in 234 ms execution time
。
我哪里错了?
Table 架构
要执行您的规则,只需声明 pvanlagen.buildid
UNIQUE
:
ALTER TABLE pvanlagen ADD CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid);
正如您的更新显示的那样,building.gid
是 PK。要同时强制引用完整性,请将 FOREIGN KEY
constraint 添加到 buildings.gid
.
你现在已经实现了。但是在 添加这些约束之前 运行 下面的大 UPDATE
会更有效率。
您的 table 定义还有很多需要改进的地方。其一,buildings.gid
和 pvanlagen.buildid
应该是类型 integer
(或者可能是 bigint
,如果你燃烧 很多 的 PK 值). numeric
是昂贵的废话。
让我们关注核心问题:
查找最近建筑物的基本查询
这个案子并不像看起来那么简单。这是一个 "nearest neighbour" 问题,唯一分配的额外复杂性。
此查询为每个 PV(PV Anlage 的缩写 - pvanlagen
中的行)查找最近的 one 建筑物,但两者均未分配,但:
SELECT pv_gid, b_gid, dist
FROM (
SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid AS b_gid
, round(ST_Distance(p.geom31467
, ST_Transform(b.centroid, 31467))::numeric, 2) AS dist -- see below
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid -- also not assigned ...
WHERE p1.buildid IS NULL -- ... yet
-- AND p.gemname = b.gemname -- not needed for performance, see below
ORDER BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b;
为了使这个查询更快,你需要 buildings
上的空间功能 GiST 索引来使它 很多更快:
CREATE INDEX build_centroid_gix ON buildings USING gist (ST_Transform(centroid, 31467));
不确定为什么你不
更多解释的相关答案:
进一步阅读:
- http://workshops.boundlessgeo.com/postgis-intro/knn.html
- http://www.postgresonline.com/journal/archives/306-KNN-GIST-with-a-Lateral-twist-Coming-soon-to-a-database-near-you.html
有了索引,我们不需要为了性能而将匹配限制为相同的 gemname
。仅当这是要强制执行的实际规则时才这样做。如果必须时刻观察,则在FK约束中包含该列:
- Restrict foreign key relationship to rows of related subtypes
剩余问题
我们可以在UPDATE
语句中使用上面的查询语句。每个 PV 仅使用一次,但多个 PV 可能仍会找到 同一建筑物 最近的建筑物。每个建筑物只允许 一个 PV。那么你会如何解决这个问题?
换句话说,你会如何在这里分配对象?
简单的解决方案
一个简单的解决方案是:
UPDATE pvanlagen p1
SET buildid = sub.b_gid
, dist = sub.dist -- actual distance
FROM (
SELECT DISTINCT ON (b_gid)
pv_gid, b_gid, dist
FROM (
SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid AS b_gid
, round(ST_Distance(p.geom31467
, ST_Transform(b.centroid, 31467))::numeric, 2) AS dist -- see below
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid -- also not assigned ...
WHERE p1.buildid IS NULL -- ... yet
-- AND p.gemname = b.gemname -- not needed for performance, see below
ORDER BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b
ORDER BY b_gid, dist, pv_gid -- tie breaker
) sub
WHERE p1.gid = sub.pv_gid;
我使用 DISTINCT ON (b_gid)
将每个建筑物精确地减少到 一个 行,选择距离最短的 PV。详情:
- Select first row in each GROUP BY group?
对于距离最近的多个 PV 的任何建筑物,仅分配最近的 PV。 PK 列 gid
(别名 pv_gid
)作为平局决胜局,如果两者同样接近。在这种情况下,一些 PV 从更新中删除并保持 未分配。 重复查询,直到分配完所有 PV。
虽然这仍然是一个简单的算法。看看我上面的图表,这将建筑物 4 分配给 PV 4,将建筑物 5 分配给 PV 5,而 4-5 和 5-4 可能是总体上更好的解决方案 ...
旁白:键入 dist
列
目前您使用numeric
。您的原始查询分配了一个常量 integer
,在 numeric
.
在我的新查询中 ST_Distance()
returns the actual distance in meters as double precision
。如果我们简单地分配我们在 numeric
数据类型中得到 15 个左右的小数位,并且数字不是 that 刚开始的数字。我严重怀疑你想浪费存储空间。
我宁愿从计算中保存原来的double precision
。或者,更好,根据需要四舍五入。如果米足够精确,只需投射并保存 integer
(自动舍入数字)。或先乘以100以节省cm:
(ST_Distance(...) * 100)::int