如何在 Bigquery 的不同值范围内查找 table 中行的聚合值?
How to find aggregate value for rows in a table in different range of values in Bigquery?
我有一个大查询 table,格式为 company_id,日期,sales_amount。 sales_amount 是一个 FLOAT64 列,其值可以从 0 到 10 亿不等。我需要为每个 company_id 第一次命中的特定 sales_amount 范围找到第一个日期。
到目前为止,我所写的是每个范围都使用了一个 with 子句,例如:
With A as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 100000 and sales_amount < 500000
GROUP BY company_id
),
B as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 500000 and sales_amount < 1000000
GROUP BY company_id
),
AllUnion AS (
SELECT * FROM A
LEFT JOIN B
USING(company_id)
WHERE B.breakDate > A.breakDate OR B.company_id is NULL
UNION ALL
SELECT * FROM B
)
因此,当添加新范围时,我必须添加一个新的 With 部分,并在最后一个大联合部分合并所有中断事件。在合并时,我将确保如果先发生高阶事件,然后过滤掉低阶事件。例如,在这种情况下,一家公司在 1 月(第一次)的销售额超过 50 万,而他们的销售额在 2 月下降并达到 12 万。只有 50 万个销售事件将被返回 2 月事件将被过滤掉
我必须为不同的 table 执行此操作并且可能会有更多事件,我想知道是否有一种聪明的方法可以在 bigquery 中编写此查询?
您可以对桶进行分桶操作,使桶内的 sales_amount
具有相同的桶 ID。然后通过 group-by company_id
和 bucket_id
,你可以得到每个桶的 MIN(date)
。
SELECT company_id, MIN(date) AS breakDate
FROM <table>
WHERE sales_amount >= 100000
GROUP BY company_id, RANGE_BUCKET(sales_amount, [100000, 500000, 1000000]);
示例:
WITH sales AS (
SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-04' AS date, 500000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
)
SELECT company_id,
buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
MIN(sales_amount) AS sales_amount,
MIN(date) AS breakDate
FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
WHERE sales_amount >= 100000
GROUP BY company_id, bucket_id
;
输出:
示例 2:
WITH sales AS (
SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-07' AS date, 500000 AS sales_amount
),
bucketized_sales AS (
SELECT company_id,
buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
MIN(sales_amount) AS sales_amount,
MIN(date) AS breakDate
FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
WHERE sales_amount >= 100000
GROUP BY company_id, bucket_id
)
SELECT *
FROM bucketized_sales
WHERE TRUE QUALIFY breakDate <= FIRST_VALUE(breakDate) OVER (PARTITION BY company_id ORDER BY bucket_id DESC)
ORDER BY breakDate
;
输出:
我有一个大查询 table,格式为 company_id,日期,sales_amount。 sales_amount 是一个 FLOAT64 列,其值可以从 0 到 10 亿不等。我需要为每个 company_id 第一次命中的特定 sales_amount 范围找到第一个日期。
到目前为止,我所写的是每个范围都使用了一个 with 子句,例如:
With A as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 100000 and sales_amount < 500000
GROUP BY company_id
),
B as (
SELECT company_id, min(date) breakDate
FROM <table>
WHERE sales_amount >= 500000 and sales_amount < 1000000
GROUP BY company_id
),
AllUnion AS (
SELECT * FROM A
LEFT JOIN B
USING(company_id)
WHERE B.breakDate > A.breakDate OR B.company_id is NULL
UNION ALL
SELECT * FROM B
)
因此,当添加新范围时,我必须添加一个新的 With 部分,并在最后一个大联合部分合并所有中断事件。在合并时,我将确保如果先发生高阶事件,然后过滤掉低阶事件。例如,在这种情况下,一家公司在 1 月(第一次)的销售额超过 50 万,而他们的销售额在 2 月下降并达到 12 万。只有 50 万个销售事件将被返回 2 月事件将被过滤掉
我必须为不同的 table 执行此操作并且可能会有更多事件,我想知道是否有一种聪明的方法可以在 bigquery 中编写此查询?
您可以对桶进行分桶操作,使桶内的 sales_amount
具有相同的桶 ID。然后通过 group-by company_id
和 bucket_id
,你可以得到每个桶的 MIN(date)
。
SELECT company_id, MIN(date) AS breakDate
FROM <table>
WHERE sales_amount >= 100000
GROUP BY company_id, RANGE_BUCKET(sales_amount, [100000, 500000, 1000000]);
示例:
WITH sales AS (
SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-04' AS date, 500000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
)
SELECT company_id,
buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
MIN(sales_amount) AS sales_amount,
MIN(date) AS breakDate
FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
WHERE sales_amount >= 100000
GROUP BY company_id, bucket_id
;
输出:
示例 2:
WITH sales AS (
SELECT 'c1' AS company_id, '2022-05-01' AS date, 99999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-02' AS date, 100000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-03' AS date, 499999 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-05' AS date, 1100000 AS sales_amount
UNION ALL
SELECT 'c1' AS company_id, '2022-05-07' AS date, 500000 AS sales_amount
),
bucketized_sales AS (
SELECT company_id,
buckets[SAFE_OFFSET(RANGE_BUCKET(sales_amount, buckets) - 1)] AS bucket_id,
MIN(sales_amount) AS sales_amount,
MIN(date) AS breakDate
FROM sales, UNNEST([STRUCT([100000, 500000, 1000000] AS buckets)])
WHERE sales_amount >= 100000
GROUP BY company_id, bucket_id
)
SELECT *
FROM bucketized_sales
WHERE TRUE QUALIFY breakDate <= FIRST_VALUE(breakDate) OVER (PARTITION BY company_id ORDER BY bucket_id DESC)
ORDER BY breakDate
;
输出: