优化 BigQuery 中的距离计算

Optimize Distance Calculation in BigQuery

我正在尝试优化 BigQuery 中的以下查询。

  Table1 has ~400K rows
  Table2 has 34M rows

我必须将表 1 中的每个 ID 映射到表 2 中最接近的邮政编码。

Table1和Table2都有经纬度数据

WITH
tmp1 AS (
SELECT
 ID, latitude, longitude 
 FROM `Table1`),

tmp2 AS (
SELECT
  CAST(ZipCode AS string) AS ZipCode ,lat,lon
  FROM `Table2` )

 SELECT
 AS VALUE ARRAY_AGG(STRUCT<ID STRING,ZipCode STRING, distance int64>(ID,
  ZipCode,
  CAST(ST_DISTANCE(tmp1.point,
      tmp2.point) AS int64))
  ORDER BY
  ST_DISTANCE(tmp1.point,
  tmp2.point)
  LIMIT
   1)[
  OFFSET
  (0)]
  FROM 
  (
  SELECT
 ID,ST_GEOGPOINT(longitude,latitude) point
  FROM tmp1) tmp1
  CROSS JOIN (
 SELECT
 ZipCode, ST_GEOGPOINT(lon, lat) point
  FROM tmp2) tmp2

将不胜感激!

您在查询的最后缺少 GROUP BY ID
我认为这会导致速度缓慢以及所有这些 CAST ...

试试下面的版本

#standardSQL
WITH tmp1 AS (
  SELECT ID, ST_GEOGPOINT(longitude, latitude) point
  FROM `Table1`
), tmp2 AS (
  SELECT CAST(ZipCode AS string) AS ZipCode, ST_GEOGPOINT(lon, lat) point
  FROM `Table2`
)
SELECT AS VALUE ARRAY_AGG(
  STRUCT(ID, ZipCode, distance)
  ORDER BY distance
  LIMIT 1
)[OFFSET(0)]
FROM (
  SELECT ID, ZipCode, ST_DISTANCE(tmp1.point, tmp2.point) AS distance
  FROM tmp1, tmp2
)
GROUP BY ID    

BigQuery 可以非常高效地进行空间连接,匹配彼此之间特定距离内的两个表中的项目。但是你需要知道那个具体的距离,或者尝试几个直到你更新所有的点。

这个 post 更详细地讨论了它: https://medium.com/@mentin/nearest-neighbor-in-bigquery-gis-7d50ebd5d63

您可以使用 BigQuery 脚本使其自动化,这是一个想法,尽管它讨论了一个稍微不同的问题,即最接近单点的几何图形: https://medium.com/@mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5