有没有办法从 min(date) 开始按 30 天间隔对时间戳数据进行分组并将它们添加为列

Is there a way to group timestamp data by 30 day intervals starting from the min(date) and add them as columns

我正在尝试使用时间戳的 min() 值作为起点,然后按 30 天的间隔对数据进行分组,以获取时间戳日期范围内每个唯一值的出现次数作为列

我有两个 table,我将它们连接起来进行计数。 Table 1 (page_creation) 有 2 列标记为 link 和 dt_crtd。 Table 2(页面访问)有另外 2 个列标记为 url 和日期。 table 正在通过加入 table1.link = table2.pagevisits 来加入。

加入后我得到一个 table 类似于:

+-------------------+------------------------+
| url               |     date               |
+-------------------+------------------------+
| www.google.com    | 2018-01-01 00:00:00'   |
| www.google.com    | 2018-01-02 00:00:00'   |
| www.google.com    | 2018-02-01 00:00:00'   |
| www.google.com    | 2018-02-05 00:00:00'   |
| www.google.com    | 2018-03-04 00:00:00'   |
| www.facebook.com  | 2014-01-05 00:00:00'   |
| www.facebook.com  | 2014-01-07 00:00:00'   |
| www.facebook.com  | 2014-04-02 00:00:00'   |
| www.facebook.com  | 2014-04-10 00:00:00'   |
| www.facebook.com  | 2014-04-11 00:00:00'   |
| www.facebook.com  | 2014-05-01 00:00:00'   |
| www.twitter.com   | 2016-02-01 00:00:00'   |
| www.twitter.com   | 2016-03-04 00:00:00'   |
+---------------------+----------------------+

我想要得到的结果是:

+-------------------+------------------------+------------+------------+-------------+
| url               | MIN_Date               | Interval 1  | Interval 2|  Interval 3 |
+-------------------+------------------------+-------------+-----------+-------------+
| www.google.com    | 2018-01-01 00:00:00'   |  2          |  2        |  1      
| www.facebook.com  | 2014-01-05 00:00:00'   |  2          |  0        |  1
| www.twitter.com   | 2016-02-01 00:00:00'   |  1          |  1        |  0    
+---------------------+----------------------+-------------+-----------+-------------+

所以 30 天间隔从最小(日期)开始,如间隔 1 所示,每 30 天计算一次。

我看过其他问题,例如:

Group rows by 7 days interval starting from a certain date

但是它似乎没有回答我的具体问题。

我也研究过数据透视语法,但注意到它仅受某些 DBMS 支持。

如有任何帮助,我们将不胜感激。

谢谢。

如果我清楚地理解您的问题,您想要计算页面创建后 30 天、60 天、90 天间隔之间的页面访问量。如果这是要求,请尝试以下 SQL 代码:-

select a11.url
,Sum(case when a12.date between a11.dt_crtd and a11.dt_crtd+30 then 1 else 0) Interval_1    
,Sum(case when a12.date between a11.dt_crtd+31 and a11.dt_crtd+60 then 1 else 0) Interval_2
,Sum(case when a12.date between a11.dt_crtd+61 and a11.dt_crtd+90 then 1 else 0) Interval_3 
from page_creation a11
join page_visits a12
on a11.link = a12.url
group by a11.url

如果您使用 BigQuery,我建议:

  • countif() 计算一个布尔值
  • timestamp_add() 将时间间隔添加到时间戳

确切的界限有点​​模糊,但我会选择:

select pc.url,
       countif(pv.date >= pc.dt_crtd and
               pv.date < timestamp_add(pc.dt_crtd, interval 30 day
              ) as Interval_00_29,    
       countif(pv.date >= timestamp_add(pc.dt_crtd, interval 30 day) and
               pv.date < timestamp_add(pc.dt_crtd, interval 60 day
              ) as Interval_30_59,    
       countif(pv.date >= timestamp_add(pc.dt_crtd, interval 60 day) and
               pv.date < timestamp_add(pc.dt_crtd, interval 90 day
              ) as Interval_60_89
from page_creation pc join
     page_visits pv
     on pc.link = pv.url
group by pc.url

我阅读你的场景的方式,特别是基于 After the join i get a table similar to ... 的例子,你有两个表需要 UNION - 而不是 JOIN

因此,根据下面的阅读示例,BigQuery Standard SQL(project.dataset.page_creationproject.dataset.page_visits 在这里只是为了模仿您的 Table 1 和 Table2)

#standardSQL
WITH `project.dataset.page_creation` AS (
  SELECT 'www.google.com' link, TIMESTAMP '2018-01-01 00:00:00' dt_crtd UNION ALL
  SELECT 'www.facebook.com', '2014-01-05 00:00:00' UNION ALL
  SELECT 'www.twitter.com', '2016-02-01 00:00:00' 
), `project.dataset.page_visits` AS (
  SELECT 'www.google.com' url, TIMESTAMP '2018-01-02 00:00:00' dt UNION ALL
  SELECT 'www.google.com', '2018-02-01 00:00:00' UNION ALL
  SELECT 'www.google.com', '2018-02-05 00:00:00' UNION ALL
  SELECT 'www.google.com', '2018-03-04 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-01-07 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-04-02 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-04-10 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-04-11 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-05-01 00:00:00' UNION ALL
  SELECT 'www.twitter.com', '2016-03-04 00:00:00' 
), `After the join` AS (
  SELECT url, dt FROM `project.dataset.page_visits` UNION DISTINCT
  SELECT link, dt_crtd FROM `project.dataset.page_creation`
)
SELECT 
  url, min_date, 
  COUNTIF(dt BETWEEN min_date AND TIMESTAMP_ADD(min_date, INTERVAL 29 DAY)) Interval_1,
  COUNTIF(dt BETWEEN TIMESTAMP_ADD(min_date, INTERVAL 30 DAY) AND TIMESTAMP_ADD(min_date, INTERVAL 59 DAY)) Interval_2,
  COUNTIF(dt BETWEEN TIMESTAMP_ADD(min_date, INTERVAL 60 DAY) AND TIMESTAMP_ADD(min_date, INTERVAL 89 DAY)) Interval_3
FROM (
  SELECT url, dt, MIN(dt) OVER(PARTITION BY url ORDER BY dt) min_date
  FROM `After the join`
)
GROUP BY url, min_date

结果为

Row url                 min_date                    Interval_1  Interval_2  Interval_3   
1   www.facebook.com    2014-01-05 00:00:00 UTC     2           0           1    
2   www.google.com      2018-01-01 00:00:00 UTC     2           2           1    
3   www.twitter.com     2016-02-01 00:00:00 UTC     1           1           0