BigQuery:计算每个人 window 时间的聚合
BigQuery: Computing aggregate over window of time for each person
给定 Google BigQuery 中的 table:
User Timestamp
A TIMESTAMP(12/05/2015 12:05:01.8023)
B TIMESTAMP(9/29/2015 12:15:01.0323)
B TIMESTAMP(9/29/2015 13:05:01.0233)
A TIMESTAMP(9/29/2015 14:05:01.0432)
C TIMESTAMP(8/15/2015 5:05:01.0000)
B TIMESTAMP(9/29/2015 14:06:01.0233)
A TIMESTAMP(9/29/2015 14:06:01.0432)
有没有简单的计算方法:
User Maximum_Number_of_Events_this_User_Had_in_One_Hour
A 2
B 3
C 1
其中一小时的时间window是一个参数?
我尝试自己使用 LAG 和分区函数的组合来解决这两个问题:
BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)
Bigquery SQL for sliding window aggregate
但发现这些帖子太不相似了,因为我不是在查找每次人数 window,而是在一段时间内查找每个人的最大事件数 window。
我认为您可以使用这样的查询(在 T-SQL 中):
SELECT "User", SUM(s) As Maximum_Number_of_Events_this_User_Had_in_One_Hour
FROM (
SELECT "User", 1 s
FROM yourTable
GROUP BY "User", CAST("Timestamp" As date), DATEPART(Hour, "Timestamp")) As t
GROUP BY "User"
试试下面的 GBQ。没有测试太多,但对我来说看起来可行
SELECT
User, Max(events) as Max_Events
FROM (
SELECT
b.User as User,
b.Timestamp as Timestamp,
COUNT(1) as Events
FROM [your_dataset.your_table] as b
JOIN (
SELECT User, Timestamp
FROM [your_dataset.your_table]
) as w
ON w.User = b.User
WHERE ROUND((TIMESTAMP_TO_SEC(TIMESTAMP(w.Timestamp)) -
TIMESTAMP_TO_SEC(TIMESTAMP(b.Timestamp))) / 3600, 1) BETWEEN 0 AND 1
GROUP BY 1, 2
)
GROUP BY 1
这是一种高效简洁的方法,它利用了时间戳的有序结构。
SELECT
user,
MAX(per_hour) AS max_event_per_hour
FROM
(
SELECT
user,
COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
timestamp
FROM
[dataset_example_in_question_user_timestamps]
)
GROUP BY user
给定 Google BigQuery 中的 table:
User Timestamp
A TIMESTAMP(12/05/2015 12:05:01.8023)
B TIMESTAMP(9/29/2015 12:15:01.0323)
B TIMESTAMP(9/29/2015 13:05:01.0233)
A TIMESTAMP(9/29/2015 14:05:01.0432)
C TIMESTAMP(8/15/2015 5:05:01.0000)
B TIMESTAMP(9/29/2015 14:06:01.0233)
A TIMESTAMP(9/29/2015 14:06:01.0432)
有没有简单的计算方法:
User Maximum_Number_of_Events_this_User_Had_in_One_Hour
A 2
B 3
C 1
其中一小时的时间window是一个参数?
我尝试自己使用 LAG 和分区函数的组合来解决这两个问题:
BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)
Bigquery SQL for sliding window aggregate
但发现这些帖子太不相似了,因为我不是在查找每次人数 window,而是在一段时间内查找每个人的最大事件数 window。
我认为您可以使用这样的查询(在 T-SQL 中):
SELECT "User", SUM(s) As Maximum_Number_of_Events_this_User_Had_in_One_Hour
FROM (
SELECT "User", 1 s
FROM yourTable
GROUP BY "User", CAST("Timestamp" As date), DATEPART(Hour, "Timestamp")) As t
GROUP BY "User"
试试下面的 GBQ。没有测试太多,但对我来说看起来可行
SELECT
User, Max(events) as Max_Events
FROM (
SELECT
b.User as User,
b.Timestamp as Timestamp,
COUNT(1) as Events
FROM [your_dataset.your_table] as b
JOIN (
SELECT User, Timestamp
FROM [your_dataset.your_table]
) as w
ON w.User = b.User
WHERE ROUND((TIMESTAMP_TO_SEC(TIMESTAMP(w.Timestamp)) -
TIMESTAMP_TO_SEC(TIMESTAMP(b.Timestamp))) / 3600, 1) BETWEEN 0 AND 1
GROUP BY 1, 2
)
GROUP BY 1
这是一种高效简洁的方法,它利用了时间戳的有序结构。
SELECT
user,
MAX(per_hour) AS max_event_per_hour
FROM
(
SELECT
user,
COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
timestamp
FROM
[dataset_example_in_question_user_timestamps]
)
GROUP BY user