Vertica SQL 用于 运行 非重复计数和 运行 条件计数

Vertica SQL for running count distinct and running conditional count

我正在尝试根据更深层次的产品 url 级别分数 table 构建部门级别分数 table。

  1. 日期不连续

  2. 并非所有 url 都在同一天获得分数更新(彼此独立)

  3. dist_url 应该是 运行 count distinct(累计 count distinct)

  4. dist urls 和 urls score >=30 都是 count distinct

我现在拥有的是:

   Date  url   Store   Dept   Page   Score   
   10/1   a      US      A      X      10   
   10/1   b      US      A      X      30  
   10/1   c      US      A      X      60
   10/4   a      US      A      X      20  
   10/4   d      US      A      X      60
   10/6   b      US      A      X      22 
   10/9   a      US      A      X      40
   10/9   e      US      A      X      10


   Date  Store   Dept   Page   dist urls   urls score >=30  
   10/1   US      A      X          3          2 
   10/4   US      A      X          4          3
   10/6   US      A      X          4          2
   10/9   US      A      X          5          2

我认为 dist_url 可以通过使用 window 函数来完成,只是在查询时不确定。

当前查询如下,但由于累积计数不同,所以错误:

   SELECT
        bm.AnalysisDate,
        su.SoID         AS Store,
        su.DptCaID      AS DTID,
        su.PageTypeID   AS PTID,
        COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
        SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
    FROM csn_seo.tblBotifyMetrics bm 
    INNER JOIN csn_seo.tblSEOURLs su 
        ON bm.SeoURLID = su.ID
    WHERE su.DptCaID IS NOT NULL 
        AND su.DptCaID <> 0    
        AND su.PageTypeID IS NOT NULL
        AND su.PageTypeID <> -1
        AND bm.iscompliant = 1
    GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;

如果有人有任何想法,请告诉我。

根据你的问题,你似乎想要两层逻辑:

select date, store, dept,
       sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
       sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
       from t
       group by store, dept, page, url 
      ) union all
      (select store, dept, page, url, min(date) as date, 0, 1
       from t
       where score >= 30
       group by store, dept, page, url 
      )
     ) t
group by date, store, dept, page;

我不明白你的查询与你的问题有什么关系。

尽我所能,我也没有得到你的输出:

但我认为您可以避免 UNION SELECTs - 这是否符合您的预期? NULLS 不计算在 COUNT DISTINCT 中 - 在这里您可以将聚合表达式与 OLAP 组合在一起...... Vertica 命名为 windows 以提高可读性....

WITH                                                                                                                                                                                                                           
input(Date,url,Store,Dept,Page,Score) AS (
          SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
  date
, store
, dept
, page
, SUM(COUNT(DISTINCT url)                              ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
  date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out     date    | store | dept | page | dist_urls | dist_urls_gt_30 
-- out ------------+-------+------+------+-----------+-----------------
-- out  2019-10-01 | US    | A    | X    |         3 |               2
-- out  2019-10-04 | US    | A    | X    |         5 |               3
-- out  2019-10-06 | US    | A    | X    |         6 |               3
-- out  2019-10-09 | US    | A    | X    |         8 |               4
-- out (4 rows)
-- out 
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms