6小时后查询超时，如何优化？

Question

我有两个 table，shapes 和 squares，我根据 GEOGRAHPY 列的交集加入。

shapestable包含车辆行驶路线：

shape_key        STRING            identifier for the shape
shape_lines      ARRAY<GEOGRAPHY>  consecutive line segments making up the shape
shape_geography  GEOGRAPHY         the union of all shape_lines
shape_length_km  FLOAT64           length of the shape in kilometers

Rows: 65k
Size: 718 MB

我们在 ARRAY 中将 shape_lines 分开，因为形状有时会自行折回，我们希望将这些线段分开而不是。

squares table 包含一个 1×1 平方公里的网格：

square_key        INT64      identifier of the grid square
square_geography  GEOGRAPHY  four-cornered polygon describing the grid square

Rows: 102k
Size: 15 MB

这些形状代表车辆的行驶路线。对于每种形状，我们都在单独的 table 中计算了有害物质的排放量。目的是计算每个网格正方形的排放量，假设它们沿路线均匀分布。为此，我们需要知道路线形状的哪一部分与每个网格单元相交。

这是用于计算的查询：

SELECT
  shape_key,
  square_key,
  SAFE_DIVIDE(
      (
        SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
        FROM UNNEST(shape_lines) AS line
      ),
      shape_length_km)
    AS square_portion
FROM
  shapes,
  squares
WHERE
  ST_INTERSECTS(shape_geography, square_geography)

遗憾的是，此查询在 6 小时后超时，没有产生有用的结果。

在最坏的情况下，查询可以产生 66 亿行，但实际上不会发生这种情况。我估计每个形状通常与 50 个网格正方形相交，因此输出应该约为 65k * 50 = 3.3M 行；没有什么是 BigQuery 无法处理的。

我考虑过 the geographic join optimizations 由 BigQuery 执行：

Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause.

检查。我什至将 INNER JOIN 重写为上面显示的等效 "comma" 连接。
Spatial joins perform better when your geography data is persisted.

检查。 shape_geography 和 square_geography 都直接来自现有的 tables.
BigQuery implements optimized spatial JOINs for INNER JOIN and CROSS JOIN operators with the following standard SQL predicate functions: [...] ST_Intersects

检查。只需要一个ST_Intersect调用，没有其他条件。
Spatial joins are not optimized: for LEFT, RIGHT or FULL OUTER joins; in cases involving ANTI joins; when the spatial predicate is negated.

检查。 None 这些情况适用。

所以我认为 BigQuery 应该能够使用它使用的任何空间索引数据结构来优化此连接。

我也考虑过 advice about cross joins:

Avoid joins that generate more outputs than inputs.

此查询产生的输出肯定多于输入；这是它的本性，无法避免。
When a CROSS JOIN is required, pre-aggregate your data.

To avoid performance issues associated with joins that generate more outputs than inputs:
- Use a GROUP BY clause to pre-aggregate the data.
检查。我已经预先汇总了按形状分组的排放数据，因此 shapes table 中的每个形状都是独一无二的。
- Use a window function. Window functions are often more efficient than using a cross join. For more information, see analytic functions.
我认为无法为该查询使用 window 函数。

我怀疑 BigQuery 根据输入行数分配资源，而不是根据中间 table 或输出的大小。这可以解释我所看到的病态行为。

如何在合理的时间内进行此查询运行？

Answer 1

下面肯定不适合评论格式所以我必须post这个作为答案...

我对你的查询做了三处调整

使用 JOIN ... ON 而不是 CROSS JOIN ... WHERE
注释掉square_portion计算
使用目的地 table 和 Allow Large Results 选项

即使您预计输出只有 330 万行 - 实际上它大约是 6.6 B ( 6,591,549,944) 行 - 您可以在下面看到我的实验结果

注意有关计费层级的警告 - 因此您最好使用预留（如果可用）
显然，取消注释 square_portion 计算会增加插槽使用量 - 因此，您可能需要重新访问您的 requirements/expectations

Answer 2

我认为 squares 颠倒了，导致几乎完整的地球多边形：

select st_area(square_geography), * from   `open-transport-data.public.squares`

打印像 5.1E14 这样的结果 - 这是整个地球区域。所以任何一条线几乎都与所有的正方形相交。有关详细信息，请参阅 BigQuery 文档：https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation

您可以通过运行 ST_GeogFromText(wkt, FALSE) 反转它们 - 选择较小的多边形，忽略多边形方向，这工作得相当快：

SELECT
  shape_key,
  square_key,
  SAFE_DIVIDE(
      (
        SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
        FROM UNNEST(shape_lines) AS line
      ),
      shape_length_km)
    AS square_portion
FROM
  `open-transport-data.public.shapes`,
  (select 
       square_key, 
       st_geogfromtext(st_astext(square_geography), FALSE) as square_geography,
     from `open-transport-data.public.squares`) squares
WHERE
  ST_INTERSECTS(shape_geography, square_geography)

6小时后查询超时，如何优化？

Query times out after 6 hours, how to optimize it?

gis

cartesian-product

google-bigquery