BigQuery 运行 每年的唯一 ID 计数
BigQuery Running Count of Unique ID per Year
我发现了一堆类似的问题,但没有具体解决这个问题(如果我错了请纠正我)。
我正在尝试---在 BigQuery 上---使用 分析函数 [=] 将 table 上的每一行与每年 运行 用户数编入索引.
因此:
with dataset as (
select 'A' as user, '2020' as year, RAND() as some_value
union all
select 'A' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'C' as user, '2020' as year, RAND() as some_value
union all
select 'C' as user, '2020' as year, RAND() as some_value
union all
select 'A' as user, '2021' as year, RAND() as some_value
union all
select 'A' as user, '2021' as year, RAND() as some_value
union all
select 'B' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
)
我想得到:
rcount | user | year | some_value
1 | A | 2020 | 0.2365421124968884
1 | A | 2020 | 0.21087749308191206
2 | B | 2020 | 0.6096882013526258
2 | B | 2020 | 0.8544447727632739
2 | B | 2020 | 0.6113604025541309
3 | C | 2020 | 0.5803237472480643
3 | C | 2020 | 0.165305669127888
1 | A | 2021 | 0.1200575362708826
1 | A | 2021 | 0.015721175944171915
2 | B | 2021 | 0.21890252010457295
3 | C | 2021 | 0.5087613385277634
3 | C | 2021 | 0.9949262690813603
3 | C | 2021 | 0.50824183164116
3 | C | 2021 | 0.8262428736484341
3 | C | 2021 | 0.6866964737106948
我试过了:
count(user) over (partition by year,user )
我也试过使用 order by year range between unbounded preceding and current row
这样的范围
和row_count()
我现在不知道去哪里寻求解决方案。
尝试以下操作:
select user
, year
, some_value
, sum(count) over (partition by year order by year, user ROWS UNBOUNDED PRECEDING) as rcount
from (
select user
, year
, some_value
, IF(lag(user,1) OVER (order by year,user)=user,0,1) count
from dataset
)
里面的sub-select定义的是根据上一行的内容来定义是否统计记录的逻辑,那么我们就简单的和外面的select.[=11=进行求和]
一个更简单的解决方案是使用 DENSE_RANK
:
SELECT
DENSE_RANK() OVER (PARTITION BY year ORDER BY user) as rcount,
user,
year,
some_value
FROM dataset
可以找到关于 DENSE_RANK
的信息 here。
我发现了一堆类似的问题,但没有具体解决这个问题(如果我错了请纠正我)。
我正在尝试---在 BigQuery 上---使用 分析函数 [=] 将 table 上的每一行与每年 运行 用户数编入索引.
因此:
with dataset as (
select 'A' as user, '2020' as year, RAND() as some_value
union all
select 'A' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'B' as user, '2020' as year, RAND() as some_value
union all
select 'C' as user, '2020' as year, RAND() as some_value
union all
select 'C' as user, '2020' as year, RAND() as some_value
union all
select 'A' as user, '2021' as year, RAND() as some_value
union all
select 'A' as user, '2021' as year, RAND() as some_value
union all
select 'B' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
union all
select 'C' as user, '2021' as year, RAND() as some_value
)
我想得到:
rcount | user | year | some_value
1 | A | 2020 | 0.2365421124968884
1 | A | 2020 | 0.21087749308191206
2 | B | 2020 | 0.6096882013526258
2 | B | 2020 | 0.8544447727632739
2 | B | 2020 | 0.6113604025541309
3 | C | 2020 | 0.5803237472480643
3 | C | 2020 | 0.165305669127888
1 | A | 2021 | 0.1200575362708826
1 | A | 2021 | 0.015721175944171915
2 | B | 2021 | 0.21890252010457295
3 | C | 2021 | 0.5087613385277634
3 | C | 2021 | 0.9949262690813603
3 | C | 2021 | 0.50824183164116
3 | C | 2021 | 0.8262428736484341
3 | C | 2021 | 0.6866964737106948
我试过了:
count(user) over (partition by year,user )
我也试过使用 order by year range between unbounded preceding and current row
和row_count()
我现在不知道去哪里寻求解决方案。
尝试以下操作:
select user
, year
, some_value
, sum(count) over (partition by year order by year, user ROWS UNBOUNDED PRECEDING) as rcount
from (
select user
, year
, some_value
, IF(lag(user,1) OVER (order by year,user)=user,0,1) count
from dataset
)
里面的sub-select定义的是根据上一行的内容来定义是否统计记录的逻辑,那么我们就简单的和外面的select.[=11=进行求和]
一个更简单的解决方案是使用 DENSE_RANK
:
SELECT
DENSE_RANK() OVER (PARTITION BY year ORDER BY user) as rcount,
user,
year,
some_value
FROM dataset
可以找到关于 DENSE_RANK
的信息 here。