PERCENTILE_CONT 并按 BigQuery 分组
PERCENTILE_CONT and group by BigQuery
我想计算column3的均值和中位数,分别针对column1和column2的不同类别。所以基本上我想计算按 column1 和 column2 分组的均值和中位数。
数据看起来像这样:
Table `xx.yy.zz`
column1 column2 column3
A A1 1
A A1 2
A A1 3
B B2 10
B B2 15
B B2 20
...
所需的输出将是:
column1 column2 median3 mean3
A A1 2 2
A A2 median mean
A A3 median mean
B B1 median mean
B B2 15 15
C C1 median mean
我试过下面的代码。代码的第一部分 (table1) 完成了正确的工作,而第二部分 (table2) 的两个试验都不起作用。我做错了什么?按 column1 和 column2 计算 column3 组的中位数的正确方法是什么?
到目前为止我的代码:
WITH
table1 AS (SELECT column1, column2,
AVG(column3) AS mean3
FROM xx.yy.zz
GROUP BY 1,2
),
table2 AS (SELECT column1, column2,
PERCENTILE_CONT(column3, 0.5) OVER(PARTITION BY column1, column2,) AS median3
FROM xx.yy.zz
group by 1,2
),
**OR**
table2 AS (SELECT
PERCENTILE_CONT(column3, 0.5) OVER(PARTITION BY column1, column2,) AS median3
FROM xx.yy.zz
),
table3 AS (SELECT * FROM table1
INNER JOIN
(SELECT * FROM table2)
USING(column1, column2)
)
SELECT * FROM table3
两个选项,第二个是近似但更快(更具可扩展性):
table2 AS (
SELECT column1, column2, MAX(median_temp) as median3
FROM (
SELECT column1, column2, PERCENTILE_CONT(column3, 0.5) OVER (PARTITION BY column1, column2) AS median_temp
FROM xx.yy.zz
)
GROUP BY 1,2
),
table2 AS (
SELECT column1, column2, APPROX_QUANTILES(column3, 100)[OFFSET(50)] AS median3
FROM xx.yy.zz
GROUP BY 1,2
),
以下适用于 BigQuery 标准 SQL
#standardsql
create temp function median (arr any type) as (
if(mod(array_length(arr), 2) = 0,
( arr[offset(div(array_length(arr), 2) - 1)] +
arr[offset(div(array_length(arr), 2))]) / 2,
arr[offset(div(array_length(arr), 2))] )
);
select column1, column2,
median(array_agg(column3 order by column3)) as median3,
avg(column3) as mean3
from `xx.yy.zz`
group by column1, column2
如果应用于您问题中的示例数据 - 输出为
我想计算column3的均值和中位数,分别针对column1和column2的不同类别。所以基本上我想计算按 column1 和 column2 分组的均值和中位数。
数据看起来像这样:
Table `xx.yy.zz`
column1 column2 column3
A A1 1
A A1 2
A A1 3
B B2 10
B B2 15
B B2 20
...
所需的输出将是:
column1 column2 median3 mean3
A A1 2 2
A A2 median mean
A A3 median mean
B B1 median mean
B B2 15 15
C C1 median mean
我试过下面的代码。代码的第一部分 (table1) 完成了正确的工作,而第二部分 (table2) 的两个试验都不起作用。我做错了什么?按 column1 和 column2 计算 column3 组的中位数的正确方法是什么?
到目前为止我的代码:
WITH
table1 AS (SELECT column1, column2,
AVG(column3) AS mean3
FROM xx.yy.zz
GROUP BY 1,2
),
table2 AS (SELECT column1, column2,
PERCENTILE_CONT(column3, 0.5) OVER(PARTITION BY column1, column2,) AS median3
FROM xx.yy.zz
group by 1,2
),
**OR**
table2 AS (SELECT
PERCENTILE_CONT(column3, 0.5) OVER(PARTITION BY column1, column2,) AS median3
FROM xx.yy.zz
),
table3 AS (SELECT * FROM table1
INNER JOIN
(SELECT * FROM table2)
USING(column1, column2)
)
SELECT * FROM table3
两个选项,第二个是近似但更快(更具可扩展性):
table2 AS (
SELECT column1, column2, MAX(median_temp) as median3
FROM (
SELECT column1, column2, PERCENTILE_CONT(column3, 0.5) OVER (PARTITION BY column1, column2) AS median_temp
FROM xx.yy.zz
)
GROUP BY 1,2
),
table2 AS (
SELECT column1, column2, APPROX_QUANTILES(column3, 100)[OFFSET(50)] AS median3
FROM xx.yy.zz
GROUP BY 1,2
),
以下适用于 BigQuery 标准 SQL
#standardsql
create temp function median (arr any type) as (
if(mod(array_length(arr), 2) = 0,
( arr[offset(div(array_length(arr), 2) - 1)] +
arr[offset(div(array_length(arr), 2))]) / 2,
arr[offset(div(array_length(arr), 2))] )
);
select column1, column2,
median(array_agg(column3 order by column3)) as median3,
avg(column3) as mean3
from `xx.yy.zz`
group by column1, column2
如果应用于您问题中的示例数据 - 输出为