如何在 Bigquery 的不同值范围内查找 table 中行的聚合值?

How to find aggregate value for rows in a table in different range of values in Bigquery?

我有一个大查询 table,格式为 company_id,日期,sales_amount。 sales_amount 是一个 FLOAT64 列,其值可以从 0 到 10 亿不等。我需要为每个 company_id 第一次命中的特定 sales_amount 范围找到第一个日期。

到目前为止,我所写的是每个范围都使用了一个 with 子句,例如:

With A as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 100000 and sales_amount < 500000
GROUP BY company_id
),
B as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 500000 and sales_amount < 1000000
GROUP BY company_id
),
AllUnion AS (
SELECT * FROM A
LEFT JOIN B
USING(company_id)
WHERE B.breakDate > A.breakDate OR B.company_id is NULL

UNION ALL
SELECT * FROM B
)

因此,当添加新范围时,我必须添加一个新的 With 部分,并在最后一个大联合部分合并所有中断事件。在合并时,我将确保如果先发生高阶事件,然后过滤掉低阶事件。例如,在这种情况下,一家公司在 1 月(第一次)的销售额超过 50 万,而他们的销售额在 2 月下降并达到 12 万。只有 50 万个销售事件将被返回 2 月事件将被过滤掉

我必须为不同的 table 执行此操作并且可能会有更多事件,我想知道是否有一种聪明的方法可以在 bigquery 中编写此查询?

您可以对桶进行分桶操作,使桶内的 sales_amount 具有相同的桶 ID。然后通过 group-by company_idbucket_id,你可以得到每个桶的 MIN(date)

SELECT company_id, MIN(date) AS breakDate 
  FROM <table>
 WHERE sales_amount >= 100000 
 GROUP BY company_id, RANGE_BUCKET(sales_amount, [100000, 500000, 1000000]);

示例:

WITH sales AS (
  SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-04' AS date, 500000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
)  
SELECT company_id, 
       buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
       MIN(sales_amount) AS sales_amount,
       MIN(date) AS breakDate
  FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
 WHERE sales_amount >= 100000 
 GROUP BY company_id, bucket_id
;

输出:

示例 2:

WITH sales AS (
  SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
   UNION ALL
  SELECT 'c1' AS company_id, '2022-05-07' AS date, 500000 AS sales_amount
),
bucketized_sales AS (
  SELECT company_id, 
         buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
         MIN(sales_amount) AS sales_amount,
         MIN(date) AS breakDate
    FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
   WHERE sales_amount >= 100000 
   GROUP BY company_id, bucket_id
)
SELECT * 
  FROM bucketized_sales 
 WHERE TRUE QUALIFY breakDate <= FIRST_VALUE(breakDate) OVER (PARTITION BY company_id ORDER BY bucket_id DESC) 
 ORDER BY breakDate
;

输出: