在大查询中移动非重复计数（SQL 语法）

Question

对于那里的所有 SQL 大师来说，我有一种奇怪的东西。我需要在 14 天的移动 window 中获取项目的不同计数。我试过 dense_rank 但它没有指定（或者我不知道如何）指定 14 天移动 window。

为简单起见，我的数据集有 3 列。

商店（字符串）
项目代码（字符串）
日期（日期）

我的远距离目标的一个简单示例如下：

第 1 天我扫描项目 1、2、3、4
第 2 天我扫描了项目 2、3、4、5
第 3 天我扫描了项目 1,6

那么第 1 天我的唯一身份数为 4，第 2 天我的唯一身份数为 5，第 3 天我的唯一身份数为 6 (1,2,3,4,5,6)

到达第 15 天后，我将忽略第 1 天找到的值，只需要第 2-15 天

我们将不胜感激任何帮助。

Answer 1

考虑以下方法

select store, date, 
  ( select count(distinct item) 
    from t.items item
  ) distinct_items_count
from (
  select store, date, any_value(items) items
  from (
    select store, date, 
      array_agg(item_code) over(partition by store order by unix_date(date) range between 13 preceding and current row) items
    from your_table
  )
  group by store, date
) t

Answer 2

要考虑的另一种选择 - 使用 HyperLogLog++ 函数 - 因此它消耗的资源更少，速度更快

select store, date, 
  ( select hll_count.merge(sketch)
    from t.sketches_14days sketch 
  ) distinct_items_count
from (
  select store, date, 
    array_agg(daily_sketch) over(partition by store order by unix_date(date) range between 13 preceding and current row) sketches_14days
  from (
    select store, date, hll_count.init(item_code) daily_sketch
    from your_table
    group by store, date
  )
) t

注：

HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

在大查询中移动非重复计数（SQL 语法）

Moving Distinct Count in Big Query (SQL syntax)

sql

time

moving-average

google-bigquery