尽管有索引,postgresql geonames db 中的查询速度很慢
Slow query in postgresql geonames db despite indexes
我将 http://www.geonames.org/ 中的所有表导入到我的本地 postgresql 9.5.3.0 数据库中,并在其中添加索引,如下所示:
create extension pg_trgm;
CREATE INDEX name_trgm_idx ON geoname USING GIN (name gin_trgm_ops);
CREATE INDEX fcode_trgm_idx ON geoname USING GIN (fcode gin_trgm_ops);
CREATE INDEX fclass_trgm_idx ON geoname USING GIN (fclass gin_trgm_ops);
CREATE INDEX alternatename_trgm_idx ON alternatename USING GIN (alternatename gin_trgm_ops);
CREATE INDEX isolanguage_trgm_idx ON alternatename USING GIN (isolanguage gin_trgm_ops);
CREATE INDEX alt_geoname_id_idx ON alternatename (geonameid)
现在我想查询不同语言的国家/地区名称,并使用这些替代名称交叉引用 geonames 属性,如下所示:
select g.geonameid as geonameid ,a.alternatename as name,g.country as country, g.fcode as fcode
from geoname g,alternatename a
where
a.isolanguage=LOWER('de')
and a.alternatename ilike '%Sa%'
and (a.ishistoric = FALSE OR a.ishistoric IS NULL)
and (a.isshortname = TRUE OR a.isshortname IS NULL)
and a.geonameid = g.geonameid
and g.fclass='A'
and g.fcode ='PCLI';
不幸的是,这个查询在具有快速 SSD 的八核机器上需要长达 13 到 15 秒的时间。 'Explain analyze verbose' 显示:
Nested Loop (cost=0.43..237138.04 rows=1 width=25) (actual time=1408.443..10878.115 rows=15 loops=1)
Output: g.geonameid, a.alternatename, g.country, g.fcode
-> Seq Scan on public.alternatename a (cost=0.00..233077.17 rows=481 width=18) (actual time=0.750..10862.089 rows=2179 loops=1)
Output: a.alternatenameid, a.geonameid, a.isolanguage, a.alternatename, a.ispreferredname, a.isshortname, a.iscolloquial, a.ishistoric
Filter: (((a.alternatename)::text ~~* '%Sa%'::text) AND ((a.isolanguage)::text = 'de'::text))
Rows Removed by Filter: 10675099
-> Index Scan using pk_geonameid on public.geoname g (cost=0.43..8.43 rows=1 width=11) (actual time=0.006..0.006 rows=0 loops=2179)
Output: g.geonameid, g.name, g.asciiname, g.alternatenames, g.latitude, g.longitude, g.fclass, g.fcode, g.country, g.cc2, g.admin1, g.admin2, g.admin3, g.admin4, g.population, g.elevation, g.gtopo30, g.timezone, g.moddate
Index Cond: (g.geonameid = a.geonameid)
Filter: ((g.fclass = 'A'::bpchar) AND ((g.fcode)::text = 'PCLI'::text))
Rows Removed by Filter: 1
这对我来说似乎表明以某种方式对 481 行(我认为相当低)执行了序列扫描,但仍然需要很长时间。我目前无法理解这一点。有什么想法吗?
仅当您搜索的字符最少为 3 个时,三元组才有效 %Sa%
无效,%foo%
有效。但是您的索引仍然不够好。根据哪些参数是动态的,使用多列或过滤索引:
CREATE INDEX jkb1 ON geoname(fclass, fcode, geonameid, country);
CREATE INDEX jkb2 ON geoname(geonameid, country) WHERE fclass = 'A' AND fcode = 'PCLI';
另一个也一样table:
CREATE INDEX jkb3 ON alternatename(geonameid, alternatename) WHERE (a.ishistoric = FALSE OR a.ishistoric IS NULL)
AND (a.isshortname = TRUE OR a.isshortname IS NULL) AND isolanguage=LOWER('de')
我将 http://www.geonames.org/ 中的所有表导入到我的本地 postgresql 9.5.3.0 数据库中,并在其中添加索引,如下所示:
create extension pg_trgm;
CREATE INDEX name_trgm_idx ON geoname USING GIN (name gin_trgm_ops);
CREATE INDEX fcode_trgm_idx ON geoname USING GIN (fcode gin_trgm_ops);
CREATE INDEX fclass_trgm_idx ON geoname USING GIN (fclass gin_trgm_ops);
CREATE INDEX alternatename_trgm_idx ON alternatename USING GIN (alternatename gin_trgm_ops);
CREATE INDEX isolanguage_trgm_idx ON alternatename USING GIN (isolanguage gin_trgm_ops);
CREATE INDEX alt_geoname_id_idx ON alternatename (geonameid)
现在我想查询不同语言的国家/地区名称,并使用这些替代名称交叉引用 geonames 属性,如下所示:
select g.geonameid as geonameid ,a.alternatename as name,g.country as country, g.fcode as fcode
from geoname g,alternatename a
where
a.isolanguage=LOWER('de')
and a.alternatename ilike '%Sa%'
and (a.ishistoric = FALSE OR a.ishistoric IS NULL)
and (a.isshortname = TRUE OR a.isshortname IS NULL)
and a.geonameid = g.geonameid
and g.fclass='A'
and g.fcode ='PCLI';
不幸的是,这个查询在具有快速 SSD 的八核机器上需要长达 13 到 15 秒的时间。 'Explain analyze verbose' 显示:
Nested Loop (cost=0.43..237138.04 rows=1 width=25) (actual time=1408.443..10878.115 rows=15 loops=1)
Output: g.geonameid, a.alternatename, g.country, g.fcode
-> Seq Scan on public.alternatename a (cost=0.00..233077.17 rows=481 width=18) (actual time=0.750..10862.089 rows=2179 loops=1)
Output: a.alternatenameid, a.geonameid, a.isolanguage, a.alternatename, a.ispreferredname, a.isshortname, a.iscolloquial, a.ishistoric
Filter: (((a.alternatename)::text ~~* '%Sa%'::text) AND ((a.isolanguage)::text = 'de'::text))
Rows Removed by Filter: 10675099
-> Index Scan using pk_geonameid on public.geoname g (cost=0.43..8.43 rows=1 width=11) (actual time=0.006..0.006 rows=0 loops=2179)
Output: g.geonameid, g.name, g.asciiname, g.alternatenames, g.latitude, g.longitude, g.fclass, g.fcode, g.country, g.cc2, g.admin1, g.admin2, g.admin3, g.admin4, g.population, g.elevation, g.gtopo30, g.timezone, g.moddate
Index Cond: (g.geonameid = a.geonameid)
Filter: ((g.fclass = 'A'::bpchar) AND ((g.fcode)::text = 'PCLI'::text))
Rows Removed by Filter: 1
这对我来说似乎表明以某种方式对 481 行(我认为相当低)执行了序列扫描,但仍然需要很长时间。我目前无法理解这一点。有什么想法吗?
仅当您搜索的字符最少为 3 个时,三元组才有效 %Sa%
无效,%foo%
有效。但是您的索引仍然不够好。根据哪些参数是动态的,使用多列或过滤索引:
CREATE INDEX jkb1 ON geoname(fclass, fcode, geonameid, country);
CREATE INDEX jkb2 ON geoname(geonameid, country) WHERE fclass = 'A' AND fcode = 'PCLI';
另一个也一样table:
CREATE INDEX jkb3 ON alternatename(geonameid, alternatename) WHERE (a.ishistoric = FALSE OR a.ishistoric IS NULL)
AND (a.isshortname = TRUE OR a.isshortname IS NULL) AND isolanguage=LOWER('de')