BigQuery 运行每年的唯一 ID 计数

Question

我发现了一堆类似的问题，但没有具体解决这个问题（如果我错了请纠正我）。

我正在尝试---在 BigQuery 上---使用 分析函数 [=] 将 table 上的每一行与每年运行用户数编入索引.

因此：

with dataset as (
    select 'A' as user, '2020' as year, RAND() as some_value
    union all
    select 'A' as user, '2020' as year, RAND() as some_value
    union all
    select 'B' as user, '2020' as year, RAND() as some_value
    union all
    select 'B' as user, '2020' as year, RAND() as some_value
    union all
    select 'B' as user, '2020' as year, RAND() as some_value
    union all
    select 'C' as user, '2020' as year, RAND() as some_value
    union all
    select 'C' as user, '2020' as year, RAND() as some_value
    union all
    select 'A' as user, '2021' as year, RAND() as some_value
    union all
    select 'A' as user, '2021' as year, RAND() as some_value
    union all
    select 'B' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
    union all
    select 'C' as user, '2021' as year, RAND() as some_value
 union all
    select 'C' as user, '2021' as year, RAND() as some_value    
)

我想得到：

rcount  | user  | year | some_value
1       | A     | 2020 | 0.2365421124968884
1       | A     | 2020 | 0.21087749308191206
2       | B     | 2020 | 0.6096882013526258
2       | B     | 2020 | 0.8544447727632739
2       | B     | 2020 | 0.6113604025541309
3       | C     | 2020 | 0.5803237472480643
3       | C     | 2020 | 0.165305669127888
1       | A     | 2021 | 0.1200575362708826
1       | A     | 2021 | 0.015721175944171915
2       | B     | 2021 | 0.21890252010457295
3       | C     | 2021 | 0.5087613385277634
3       | C     | 2021 | 0.9949262690813603
3       | C     | 2021 | 0.50824183164116
3       | C     | 2021 | 0.8262428736484341
3       | C     | 2021 | 0.6866964737106948

我试过了：

count(user) over (partition by year,user )

我也试过使用 order by year range between unbounded preceding and current row

这样的范围

和row_count() 我现在不知道去哪里寻求解决方案。

Answer 1

尝试以下操作：

select user
    , year
    , some_value
    , sum(count) over (partition by year order by year, user ROWS UNBOUNDED PRECEDING)  as rcount
from (
    select user
        , year
        , some_value
        , IF(lag(user,1) OVER (order by year,user)=user,0,1) count
    from dataset
)

里面的sub-select定义的是根据上一行的内容来定义是否统计记录的逻辑，那么我们就简单的和外面的select.[=11=进行求和]

Answer 2

一个更简单的解决方案是使用 DENSE_RANK:

SELECT 
  DENSE_RANK() OVER (PARTITION BY year ORDER BY user) as rcount,
  user,
  year,
  some_value
FROM dataset

可以找到关于 DENSE_RANK 的信息 here。

BigQuery 运行每年的唯一 ID 计数

BigQuery Running Count of Unique ID per Year

sql

google-bigquery

BigQuery 运行 每年的唯一 ID 计数

BigQuery Running Count of Unique ID per Year

sql

google-bigquery

BigQuery 运行每年的唯一 ID 计数