如何在 PostgreSQL 中强制从每个不同的 countries/cities 中随机选择行?
How to enforce random selection of rows from each of the different countries/cities in PostgreSQL?
我正在 dbeaver 中开发 PostgreSQL。数据库有一个列 addr:country
和一个列 addr:city
。数据有大约 5 亿行,所以我必须进行随机抽样进行测试。我打算随机 select 1% 的数据。但是,数据本身可能存在很大偏差(因为有大国和小国,因此大国的行数较多,小国的行数较少),我正在考虑一种公平抽样的方法。所以我想从每个国家/地区的每个城市随机 select 一两行。
我使用的脚本是根据别人的查询修改的,我的脚本是:
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*)
OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt"
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT NULL
它returns错误信息:SQL Error [42601]: ERROR: syntax error at or near "(" Position: 1683
.
我对 SQL 很陌生,所以脚本中可能有很多错误。有什么方法可以强制每个 addr:country
中的每个 addr:city
中的 select 行随机 one/two 行?
您可以使用 window 函数 dense_rank() 对分区中的记录进行随机编号:
with base_data as
(
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version,
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*) OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt",
dense_rank() over (partition by "addr:country", "addr:city" order by random()) as ranking,
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT null
)
select
*
from base_data
where ranking between 1 and 2
我正在 dbeaver 中开发 PostgreSQL。数据库有一个列 addr:country
和一个列 addr:city
。数据有大约 5 亿行,所以我必须进行随机抽样进行测试。我打算随机 select 1% 的数据。但是,数据本身可能存在很大偏差(因为有大国和小国,因此大国的行数较多,小国的行数较少),我正在考虑一种公平抽样的方法。所以我想从每个国家/地区的每个城市随机 select 一两行。
我使用的脚本是根据别人的查询修改的,我的脚本是:
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*)
OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt"
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT NULL
它returns错误信息:SQL Error [42601]: ERROR: syntax error at or near "(" Position: 1683
.
我对 SQL 很陌生,所以脚本中可能有很多错误。有什么方法可以强制每个 addr:country
中的每个 addr:city
中的 select 行随机 one/two 行?
您可以使用 window 函数 dense_rank() 对分区中的记录进行随机编号:
with base_data as
(
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version,
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*) OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt",
dense_rank() over (partition by "addr:country", "addr:city" order by random()) as ranking,
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT null
)
select
*
from base_data
where ranking between 1 and 2