BigQuery:计算 运行 总和时超出资源

BigQuery: resources exceeded when computing running sum

我有以下查询导致 'resources exceeded' 错误

SELECT TIME, VALUE, SUM(TAG) OVER (ORDER BY TIME ASC) AS RUNNING 
FROM `grid-frequency.frequency.tagged_excursions` 

错误指出 OVER 子句是罪魁祸首。但是,如果我只是按时间对 table 进行排序而不计算 运行 总数,例如

SELECT TIME, VALUE 
FROM `grid-frequency.frequency.tagged_excursions`  ORDER BY TIME ASC LIMIT 1000

它工作得很好。为什么前者比后者贵,我怎样才能更有效地计算它?

实际上,您的第二个查询不需要对整个 table 进行排序。它更接近 Top-N 排序问题,因为 LIMIT 1000 我猜这比第一个查询更有效。

我之前尝试解决类似问题的方法是将问题space分解成更小的问题,一个一个地解决它们,然后将它们组合起来产生与我想要的相同的结果。

下面是我以前尝试过的简化查询。在查询中,我通过使用日期中的 month 信息将 table 分成较小的部分,并首先计算 cumulative monthly sum。通过将此 cumulative monthly sum 添加到 net cumulative daily sum,可以计算出相同的结果。

我认为他的方法降低了排序的复杂性并有助于解决 'resources exceeded' 错误。

希望这对您有所帮助。

DECLARE purchase_log ARRAY<STRUCT<
    dt STRING,
    order_id INT64,
    user_id STRING,
    purchase_amount INT64
>>
DEFAULT [
  ('2014-01-01',  1, 'rhwpvvitou', 13900),
  ('2014-01-02',  4, 'wkmqqwbyai', 14893),
  ('2014-01-03',  5, 'ciecbedwbq', 13054),
  ('2014-02-03',  7, 'dfgqftdocu', 15591),
  ('2014-02-04',  8, 'sbgqlzkvyn',  3025),
  ('2014-02-05', 11, 'jqcmmguhik',  4235),
  ('2014-03-05', 13, 'pgeojzoshx', 16008),
  ('2014-03-06', 16, 'gbchhkcotf',  3966),
  ('2014-03-07', 17, 'zfmbpvpzvu', 28159),
  ('2014-04-07', 19, 'uyqboqfgex', 10805),
  ('2014-04-08', 21, 'zosbvlylpv', 13999),
  ('2014-05-08', 22, 'bwfbchzgnl',  2299),
  ('2014-05-09', 23, 'zzgauelgrt', 16475),
  ('2014-05-09', 24, 'qrzfcwecge',  6469),
  ('2014-05-10', 26, 'cyxfgumkst', 11339)
];


WITH sales AS (
  SELECT p.*,
         -- divide the problem space into smaller ones  
         EXTRACT(MONTH FROM DATE(dt)) AS month,
         SUM(purchase_amount) OVER (PARTITION BY EXTRACT(MONTH FROM DATE(dt)) ORDER BY dt) AS net_cumulative_sales,
    FROM UNNEST(purchase_log) p
),
monthly_cumulative_sales AS (
  SELECT month,
         IFNULL(SUM(SUM(purchase_amount)) OVER w, 0) AS cumulative_monthly_sales 
    FROM sales GROUP BY 1
  WINDOW w AS (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
SELECT dt, purchase_amount,
       net_cumulative_sales + cumulative_monthly_sales AS cumulative_sales,
       -- below column is for validation, should be same as `cumulative_sales`
       SUM(purchase_amount) OVER (ORDER BY dt) AS validation
  FROM sales JOIN monthly_cumulative_sales USING (month)
;