在大查询中移动非重复计数(SQL 语法)

Moving Distinct Count in Big Query (SQL syntax)

对于那里的所有 SQL 大师来说,我有一种奇怪的东西。我需要在 14 天的移动 window 中获取项目的不同计数。我试过 dense_rank 但它没有指定(或者我不知道如何)指定 14 天移动 window。

为简单起见,我的数据集有 3 列。

  1. 商店(字符串)
  2. 项目代码(字符串)
  3. 日期(日期)

我的远距离目标的一个简单示例如下:

那么第 1 天我的唯一身份数为 4,第 2 天我的唯一身份数为 5,第 3 天我的唯一身份数为 6 (1,2,3,4,5,6)

到达第 15 天后,我将忽略第 1 天找到的值,只需要第 2-15 天

我们将不胜感激任何帮助。

考虑以下方法

select store, date, 
  ( select count(distinct item) 
    from t.items item
  ) distinct_items_count
from (
  select store, date, any_value(items) items
  from (
    select store, date, 
      array_agg(item_code) over(partition by store order by unix_date(date) range between 13 preceding and current row) items
    from your_table
  )
  group by store, date
) t

要考虑的另一种选择 - 使用 HyperLogLog++ 函数 - 因此它消耗的资源更少,速度更快

select store, date, 
  ( select hll_count.merge(sketch)
    from t.sketches_14days sketch 
  ) distinct_items_count
from (
  select store, date, 
    array_agg(daily_sketch) over(partition by store order by unix_date(date) range between 13 preceding and current row) sketches_14days
  from (
    select store, date, hll_count.init(item_code) daily_sketch
    from your_table
    group by store, date
  )
) t      

注:

HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.