优化 BigQuery 中的距离计算
Optimize Distance Calculation in BigQuery
我正在尝试优化 BigQuery 中的以下查询。
Table1 has ~400K rows
Table2 has 34M rows
我必须将表 1 中的每个 ID 映射到表 2 中最接近的邮政编码。
Table1和Table2都有经纬度数据
WITH
tmp1 AS (
SELECT
ID, latitude, longitude
FROM `Table1`),
tmp2 AS (
SELECT
CAST(ZipCode AS string) AS ZipCode ,lat,lon
FROM `Table2` )
SELECT
AS VALUE ARRAY_AGG(STRUCT<ID STRING,ZipCode STRING, distance int64>(ID,
ZipCode,
CAST(ST_DISTANCE(tmp1.point,
tmp2.point) AS int64))
ORDER BY
ST_DISTANCE(tmp1.point,
tmp2.point)
LIMIT
1)[
OFFSET
(0)]
FROM
(
SELECT
ID,ST_GEOGPOINT(longitude,latitude) point
FROM tmp1) tmp1
CROSS JOIN (
SELECT
ZipCode, ST_GEOGPOINT(lon, lat) point
FROM tmp2) tmp2
将不胜感激!
您在查询的最后缺少 GROUP BY ID
。
我认为这会导致速度缓慢以及所有这些 CAST ...
试试下面的版本
#standardSQL
WITH tmp1 AS (
SELECT ID, ST_GEOGPOINT(longitude, latitude) point
FROM `Table1`
), tmp2 AS (
SELECT CAST(ZipCode AS string) AS ZipCode, ST_GEOGPOINT(lon, lat) point
FROM `Table2`
)
SELECT AS VALUE ARRAY_AGG(
STRUCT(ID, ZipCode, distance)
ORDER BY distance
LIMIT 1
)[OFFSET(0)]
FROM (
SELECT ID, ZipCode, ST_DISTANCE(tmp1.point, tmp2.point) AS distance
FROM tmp1, tmp2
)
GROUP BY ID
BigQuery 可以非常高效地进行空间连接,匹配彼此之间特定距离内的两个表中的项目。但是你需要知道那个具体的距离,或者尝试几个直到你更新所有的点。
这个 post 更详细地讨论了它:
https://medium.com/@mentin/nearest-neighbor-in-bigquery-gis-7d50ebd5d63
您可以使用 BigQuery 脚本使其自动化,这是一个想法,尽管它讨论了一个稍微不同的问题,即最接近单点的几何图形:
https://medium.com/@mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5
我正在尝试优化 BigQuery 中的以下查询。
Table1 has ~400K rows
Table2 has 34M rows
我必须将表 1 中的每个 ID 映射到表 2 中最接近的邮政编码。
Table1和Table2都有经纬度数据
WITH
tmp1 AS (
SELECT
ID, latitude, longitude
FROM `Table1`),
tmp2 AS (
SELECT
CAST(ZipCode AS string) AS ZipCode ,lat,lon
FROM `Table2` )
SELECT
AS VALUE ARRAY_AGG(STRUCT<ID STRING,ZipCode STRING, distance int64>(ID,
ZipCode,
CAST(ST_DISTANCE(tmp1.point,
tmp2.point) AS int64))
ORDER BY
ST_DISTANCE(tmp1.point,
tmp2.point)
LIMIT
1)[
OFFSET
(0)]
FROM
(
SELECT
ID,ST_GEOGPOINT(longitude,latitude) point
FROM tmp1) tmp1
CROSS JOIN (
SELECT
ZipCode, ST_GEOGPOINT(lon, lat) point
FROM tmp2) tmp2
将不胜感激!
您在查询的最后缺少 GROUP BY ID
。
我认为这会导致速度缓慢以及所有这些 CAST ...
试试下面的版本
#standardSQL
WITH tmp1 AS (
SELECT ID, ST_GEOGPOINT(longitude, latitude) point
FROM `Table1`
), tmp2 AS (
SELECT CAST(ZipCode AS string) AS ZipCode, ST_GEOGPOINT(lon, lat) point
FROM `Table2`
)
SELECT AS VALUE ARRAY_AGG(
STRUCT(ID, ZipCode, distance)
ORDER BY distance
LIMIT 1
)[OFFSET(0)]
FROM (
SELECT ID, ZipCode, ST_DISTANCE(tmp1.point, tmp2.point) AS distance
FROM tmp1, tmp2
)
GROUP BY ID
BigQuery 可以非常高效地进行空间连接,匹配彼此之间特定距离内的两个表中的项目。但是你需要知道那个具体的距离,或者尝试几个直到你更新所有的点。
这个 post 更详细地讨论了它: https://medium.com/@mentin/nearest-neighbor-in-bigquery-gis-7d50ebd5d63
您可以使用 BigQuery 脚本使其自动化,这是一个想法,尽管它讨论了一个稍微不同的问题,即最接近单点的几何图形: https://medium.com/@mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5