BigQuery:计算 运行 总和时超出资源
BigQuery: resources exceeded when computing running sum
我有以下查询导致 'resources exceeded' 错误
SELECT TIME, VALUE, SUM(TAG) OVER (ORDER BY TIME ASC) AS RUNNING
FROM `grid-frequency.frequency.tagged_excursions`
错误指出 OVER 子句是罪魁祸首。但是,如果我只是按时间对 table 进行排序而不计算 运行 总数,例如
SELECT TIME, VALUE
FROM `grid-frequency.frequency.tagged_excursions` ORDER BY TIME ASC LIMIT 1000
它工作得很好。为什么前者比后者贵,我怎样才能更有效地计算它?
实际上,您的第二个查询不需要对整个 table 进行排序。它更接近 Top-N 排序问题,因为 LIMIT 1000
我猜这比第一个查询更有效。
我之前尝试解决类似问题的方法是将问题space分解成更小的问题,一个一个地解决它们,然后将它们组合起来产生与我想要的相同的结果。
下面是我以前尝试过的简化查询。在查询中,我通过使用日期中的 month
信息将 table 分成较小的部分,并首先计算 cumulative monthly sum
。通过将此 cumulative monthly sum
添加到 net cumulative daily sum
,可以计算出相同的结果。
我认为他的方法降低了排序的复杂性并有助于解决 'resources exceeded' 错误。
希望这对您有所帮助。
DECLARE purchase_log ARRAY<STRUCT<
dt STRING,
order_id INT64,
user_id STRING,
purchase_amount INT64
>>
DEFAULT [
('2014-01-01', 1, 'rhwpvvitou', 13900),
('2014-01-02', 4, 'wkmqqwbyai', 14893),
('2014-01-03', 5, 'ciecbedwbq', 13054),
('2014-02-03', 7, 'dfgqftdocu', 15591),
('2014-02-04', 8, 'sbgqlzkvyn', 3025),
('2014-02-05', 11, 'jqcmmguhik', 4235),
('2014-03-05', 13, 'pgeojzoshx', 16008),
('2014-03-06', 16, 'gbchhkcotf', 3966),
('2014-03-07', 17, 'zfmbpvpzvu', 28159),
('2014-04-07', 19, 'uyqboqfgex', 10805),
('2014-04-08', 21, 'zosbvlylpv', 13999),
('2014-05-08', 22, 'bwfbchzgnl', 2299),
('2014-05-09', 23, 'zzgauelgrt', 16475),
('2014-05-09', 24, 'qrzfcwecge', 6469),
('2014-05-10', 26, 'cyxfgumkst', 11339)
];
WITH sales AS (
SELECT p.*,
-- divide the problem space into smaller ones
EXTRACT(MONTH FROM DATE(dt)) AS month,
SUM(purchase_amount) OVER (PARTITION BY EXTRACT(MONTH FROM DATE(dt)) ORDER BY dt) AS net_cumulative_sales,
FROM UNNEST(purchase_log) p
),
monthly_cumulative_sales AS (
SELECT month,
IFNULL(SUM(SUM(purchase_amount)) OVER w, 0) AS cumulative_monthly_sales
FROM sales GROUP BY 1
WINDOW w AS (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
SELECT dt, purchase_amount,
net_cumulative_sales + cumulative_monthly_sales AS cumulative_sales,
-- below column is for validation, should be same as `cumulative_sales`
SUM(purchase_amount) OVER (ORDER BY dt) AS validation
FROM sales JOIN monthly_cumulative_sales USING (month)
;
我有以下查询导致 'resources exceeded' 错误
SELECT TIME, VALUE, SUM(TAG) OVER (ORDER BY TIME ASC) AS RUNNING
FROM `grid-frequency.frequency.tagged_excursions`
错误指出 OVER 子句是罪魁祸首。但是,如果我只是按时间对 table 进行排序而不计算 运行 总数,例如
SELECT TIME, VALUE
FROM `grid-frequency.frequency.tagged_excursions` ORDER BY TIME ASC LIMIT 1000
它工作得很好。为什么前者比后者贵,我怎样才能更有效地计算它?
实际上,您的第二个查询不需要对整个 table 进行排序。它更接近 Top-N 排序问题,因为 LIMIT 1000
我猜这比第一个查询更有效。
我之前尝试解决类似问题的方法是将问题space分解成更小的问题,一个一个地解决它们,然后将它们组合起来产生与我想要的相同的结果。
下面是我以前尝试过的简化查询。在查询中,我通过使用日期中的 month
信息将 table 分成较小的部分,并首先计算 cumulative monthly sum
。通过将此 cumulative monthly sum
添加到 net cumulative daily sum
,可以计算出相同的结果。
我认为他的方法降低了排序的复杂性并有助于解决 'resources exceeded' 错误。
希望这对您有所帮助。
DECLARE purchase_log ARRAY<STRUCT<
dt STRING,
order_id INT64,
user_id STRING,
purchase_amount INT64
>>
DEFAULT [
('2014-01-01', 1, 'rhwpvvitou', 13900),
('2014-01-02', 4, 'wkmqqwbyai', 14893),
('2014-01-03', 5, 'ciecbedwbq', 13054),
('2014-02-03', 7, 'dfgqftdocu', 15591),
('2014-02-04', 8, 'sbgqlzkvyn', 3025),
('2014-02-05', 11, 'jqcmmguhik', 4235),
('2014-03-05', 13, 'pgeojzoshx', 16008),
('2014-03-06', 16, 'gbchhkcotf', 3966),
('2014-03-07', 17, 'zfmbpvpzvu', 28159),
('2014-04-07', 19, 'uyqboqfgex', 10805),
('2014-04-08', 21, 'zosbvlylpv', 13999),
('2014-05-08', 22, 'bwfbchzgnl', 2299),
('2014-05-09', 23, 'zzgauelgrt', 16475),
('2014-05-09', 24, 'qrzfcwecge', 6469),
('2014-05-10', 26, 'cyxfgumkst', 11339)
];
WITH sales AS (
SELECT p.*,
-- divide the problem space into smaller ones
EXTRACT(MONTH FROM DATE(dt)) AS month,
SUM(purchase_amount) OVER (PARTITION BY EXTRACT(MONTH FROM DATE(dt)) ORDER BY dt) AS net_cumulative_sales,
FROM UNNEST(purchase_log) p
),
monthly_cumulative_sales AS (
SELECT month,
IFNULL(SUM(SUM(purchase_amount)) OVER w, 0) AS cumulative_monthly_sales
FROM sales GROUP BY 1
WINDOW w AS (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
SELECT dt, purchase_amount,
net_cumulative_sales + cumulative_monthly_sales AS cumulative_sales,
-- below column is for validation, should be same as `cumulative_sales`
SUM(purchase_amount) OVER (ORDER BY dt) AS validation
FROM sales JOIN monthly_cumulative_sales USING (month)
;