如何计算 SQL Google Big Query 中不同字符串值的每周和每月出现次数?

How to calculate weekly and monthly appearances of distinct string values in SQL Google Big Query?

我是 SQL 的新手,我有一个包含日期值和域列的数据集。域列仅包含值 'personal' 和 'business'。我想要完成的是计算每种域类型的每周和每月滚动计数。

我想做的是创建 2 个单独的列 - is_personal 和 is_business - 其中 domain_type 具有适当值的行的值为 1。例如,如果 domain_type 为 'personal',则 is_personal 列中的值为 1。否则,1 将在 is_business 的行中。然后,我要计算滚动总和。

但是,我想知道我是否可以避免创建额外的列并直接从 Google Big Query 中的字符串列执行每周和每月滚动计数。

到目前为止,我尝试的是使用 DATE_TRUNC(CAST(created_at AS date), ISOWEEK) 到 'roll-up' 日期按周“分组依据”日期列。当我在 domain_type 列上尝试任何滚动函数时,我会遇到很多错误。有些与尝试无法被 Google Big Query 识别的函数有关,有些与我正在使用字符串列这一事实有关,等等。

我要实现的最终目标是计算 'business' 和 'personal' 域类型的每周和每月滚动计数。如果我可以提供有帮助的其他信息,请告诉我。谢谢!

当前数据:

       Date          domain_type

     2017-10-02      personal
     2017-10-03      business
     2017-10-04      personal
     2017-10-05      business
     2017-10-06      personal
     2017-10-07      business
     2017-10-08      personal 
     2017-10-09      business
     2017-10-10      personal
     2017-10-11      business
     2017-10-12      personal
     2017-10-13      business
     2017-10-14      personal
     2017-10-15      business

假设在 2017 年 10 月 2 日这一周,共有 10 位用户使用个人电子邮件地址注册,共有 20 位用户使用公司电子邮件地址注册。在 2017-10-09 的一周内,共有 25 人使用个人邮箱注册,30 人使用企业邮箱注册。因此,对于 2 周,个人域类型的滚动计数为 35,企业域类型的滚动计数为 50。

我想要实现的输出:

Date          domain_type  rolling_count_for_week
2017-10-02    personal           10
2017-10-02    business           20
2017-10-09    personal           35
2017-10-09    business           50

如果您想要一周内不同值的数量,请使用聚合:

select date_trunc(date, week) as wk, email_type,
       count(*)  -- or count(distinct email) if they are not already unique
from t
group by wk, email_type
order by 1, 2;

我没有看到任何关于您正在尝试做的事情的“滚动” - 除非,也许,您想要连续两周的计数。如果是这种情况,请使用 window 函数:

select date_trunc(date, week) as wk, email_type,
       count(*) as this_week,
       sum(count(*)) over (partition by email_type order by date_trunc(date, wk) rows between 1 preceding and current row) as 2_week_count
from t
group by wk, email_type
order by 1, 2;
WITH
  weekly AS
(
  SELECT
    DATE_TRUNC(CAST(created_at AS date), ISOWEEK)   AS created_week,
    *
  FROM
    yourData
)
SELECT
  created_week,
  domain_type,
  SUM(COUNT(*)) OVER (PARTITION BY domain_type ORDER BY created_week) AS cumulative_emails
FROM
  weekly
GROUP BY
  created_week,
  domain_type

以下适用于 BigQuery 标准 SQL

#standardSQL
SELECT Date, domain_type, 
  SUM(IF(domain_type = 'personal', personal, business)) AS rolling_count_for_week
FROM (
  SELECT Date, type AS domain_type, 
    SUM(IF(domain_type = 'personal' AND domain_type = type, 1, 0)) OVER(ORDER BY Date) personal, 
    SUM(IF(domain_type = 'business' AND domain_type = type, 1, 0)) OVER(ORDER BY Date) business
  FROM `project.dataset.table`,
  UNNEST(['personal', 'business']) type
)
WHERE EXTRACT(DAYOFWEEK FROM Date) = 2
GROUP BY Date, domain_type

如果应用于您问题中的样本数据 - 输出是

Row Date            domain_type rolling_count_for_week   
1   2017-10-02  personal    1    
2   2017-10-02  business    0    
3   2017-10-09  personal    4    
4   2017-10-09  business    4      

What if, for one particular week, there is no data on dow=2 but there is data for the other days?

说得好,我假设每天至少有一个条目:o)

查看下面没有此依赖项的版本

#standardSQL
WITH calendar_type AS (
  SELECT Date, type
  FROM (
    SELECT MIN(Date) min_date, MAX(Date) max_date
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_date, max_date)) Date,
  UNNEST(['personal', 'business']) type
)
SELECT Date, domain_type, 
  SUM(IF(domain_type = 'personal', personal, business)) AS rolling_count_for_week
FROM (
  SELECT c.Date, type AS domain_type, 
    SUM(IF(domain_type = 'personal' AND domain_type = type, 1, 0)) OVER(ORDER BY c.Date) personal, 
    SUM(IF(domain_type = 'business' AND domain_type = type, 1, 0)) OVER(ORDER BY c.Date) business
  FROM calendar_type c
  LEFT JOIN `project.dataset.table` t
  ON c.Date = t.Date AND c.type = t. domain_type 
)
WHERE EXTRACT(DAYOFWEEK FROM Date) = 2
GROUP BY Date, domain_type