如何在 ClickHouse 中按时间桶分组并用 nulls/0s 填充缺失数据
How to group by time bucket in ClickHouse and fill missing data with nulls/0s
假设我有一个给定的时间范围。为了便于解释,让我们考虑一些简单的事情,比如 2018 年全年。我想从 ClickHouse 查询数据作为每个季度的总和聚合,因此结果应该是 4 行。
问题是我只有两个季度的数据,所以当使用 GROUP BY quarter
时,只返回两行。
SELECT
toStartOfQuarter(created_at) AS time,
sum(metric) metric
FROM mytable
WHERE
created_at >= toDate(1514761200) AND created_at >= toDateTime(1514761200)
AND
created_at <= toDate(1546210800) AND created_at <= toDateTime(1546210800)
GROUP BY time
ORDER BY time
1514761200
– 2018-01-01
1546210800
– 2018-12-31
这个returns:
time metric
2018-01-01 345
2018-04-01 123
我需要:
time metric
2018-01-01 345
2018-04-01 123
2018-07-01 0
2018-10-01 0
这是简化的示例,但在实际用例中聚合将是例如。 5 分钟而不是四分之一,并且 GROUP BY 将至少有一个属性,如 GROUP BY attribute1, time
所以期望的结果是
time metric attribute1
2018-01-01 345 1
2018-01-01 345 2
2018-04-01 123 1
2018-04-01 123 2
2018-07-01 0 1
2018-07-01 0 2
2018-10-01 0 1
2018-10-01 0 2
有没有办法以某种方式填充整个给定间隔?就像 InfluxDB 有 fill
组参数或 TimescaleDb 的 time_bucket()
function with generate_series()
我试图搜索 ClickHouse 文档和 github 问题,似乎这还没有实现,所以问题可能是是否有任何解决方法。
您可以使用 "number" 函数生成零值。然后使用 UNION ALL 加入您的查询和零值,并根据我们获得的数据进行 GROUP BY。
因此,您的查询将如下所示:
SELECT SUM(metric),
time
FROM (
SELECT toStartOfQuarter(toDate(1514761200+number*30*24*3600)) time,
toUInt16(0) AS metric
FROM numbers(30)
UNION ALL
SELECT toStartOfQuarter(created_at) AS time,
metric
FROM mytable
WHERE created_at >= toDate(1514761200)
AND created_at >= toDateTime(1514761200)
AND created_at <= toDate(1546210800)
AND created_at <= toDateTime(1546210800)
)
GROUP BY time
ORDER BY time
注意 UInt16(0) - 零值必须与 metrics
的类型相同
在某些情况下 numbers()
函数的替代方法 range
和数组函数可能很有用。
示例:应生成前 7 天的每一对 (id1,id2) 日期。
SELECT
id1,
id2,
arrayJoin(
arrayMap( x -> today() - 7 + x, range(7) )
) as date2
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2
select 的结果可以用在 UNION ALL 中来填充 'holes' 的数据。
SELECT id1, id2, date, sum(column1)
FROM (
SELECT
id1,
id2,
date,
column1
FROM table
WHERE date >= now() - 7
UNION ALL
SELECT
id1,
id2,
arrayJoin(
arrayMap( x -> today() - 7 + x, range(7) )
) as date2,
0 as column1
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2
)
GROUP BY id1, id2, date
ORDER BY date, id1, id2
以下是我如何处理小时桶(需要在 Grafana 中可视化)
感谢@filimonov 和@mikhail
SELECT t, SUM(metric) as metric FROM (
SELECT
arrayJoin(
arrayMap( x -> toStartOfHour(addHours(toDateTime($from),x)),
range(toUInt64(
dateDiff('hour',
toDateTime($from),
toDateTime($to)) + 1)))
) as t,
0 as metric
UNION ALL
SELECT
toStartOfHour(my_date) as t,
COUNT(metric)
FROM my_table
WHERE t BETWEEN toDateTime($from) AND toDateTime($to)
GROUP BY t
)
GROUP BY t ORDER BY t
因此,例如,对于从 2019-01-01 到 2019-01-02 的范围,它将为您提供:
SELECT t, SUM(metric) as metric FROM (
SELECT
arrayJoin(
arrayMap( x -> toStartOfHour(addHours(toDateTime('2019-01-01 00:00:00'),x)),
range(toUInt64(
dateDiff('hour',
toDateTime('2019-01-01 00:00:00'),
toDateTime('2019-01-02 00:00:00')) + 1)))
) as t,
0 as metric
UNION ALL
SELECT
toStartOfHour(my_date) as t,
COUNT(1) as metric
FROM my_table
WHERE t BETWEEN toDateTime('2019-01-01 00:00:00') AND toDateTime('2019-01-02 00:00:00')
GROUP BY t
)
GROUP BY t ORDER BY t;
t |metric|
-------------------|------|
2019-01-01 00:00:00| 0|
2019-01-01 01:00:00| 0|
2019-01-01 02:00:00| 0|
2019-01-01 03:00:00| 0|
2019-01-01 04:00:00| 0|
2019-01-01 05:00:00| 0|
2019-01-01 06:00:00| 0|
2019-01-01 07:00:00|105702|
2019-01-01 08:00:00|113315|
2019-01-01 09:00:00|149837|
2019-01-01 10:00:00|185314|
2019-01-01 11:00:00|246106|
2019-01-01 12:00:00|323036|
2019-01-01 13:00:00| 0|
2019-01-01 14:00:00|409160|
2019-01-01 15:00:00|379113|
2019-01-01 16:00:00|256634|
2019-01-01 17:00:00|286601|
2019-01-01 18:00:00|280039|
2019-01-01 19:00:00|248504|
2019-01-01 20:00:00|218642|
2019-01-01 21:00:00|186152|
2019-01-01 22:00:00|148478|
2019-01-01 23:00:00|109721|
2019-01-02 00:00:00| 0|
从 ClickHouse 19.14 开始,您可以使用 WITH FILL
子句。它可以这样填充宿舍:
WITH
(
SELECT toRelativeQuarterNum(toDate('1970-01-01'))
) AS init
SELECT
-- build the date from the relative quarter number
toDate('1970-01-01') + toIntervalQuarter(q - init) AS time,
metric
FROM
(
SELECT
toRelativeQuarterNum(created_at) AS q,
sum(rand()) AS metric
FROM
(
-- generate some dates and metrics values with gaps
SELECT toDate(arrayJoin(range(1514761200, 1546210800, ((60 * 60) * 24) * 180))) AS created_at
)
GROUP BY q
ORDER BY q ASC WITH FILL FROM toRelativeQuarterNum(toDate(1514761200)) TO toRelativeQuarterNum(toDate(1546210800)) STEP 1
)
┌───────time─┬─────metric─┐
│ 2018-01-01 │ 2950782089 │
│ 2018-04-01 │ 2972073797 │
│ 2018-07-01 │ 0 │
│ 2018-10-01 │ 179581958 │
└────────────┴────────────┘
假设我有一个给定的时间范围。为了便于解释,让我们考虑一些简单的事情,比如 2018 年全年。我想从 ClickHouse 查询数据作为每个季度的总和聚合,因此结果应该是 4 行。
问题是我只有两个季度的数据,所以当使用 GROUP BY quarter
时,只返回两行。
SELECT
toStartOfQuarter(created_at) AS time,
sum(metric) metric
FROM mytable
WHERE
created_at >= toDate(1514761200) AND created_at >= toDateTime(1514761200)
AND
created_at <= toDate(1546210800) AND created_at <= toDateTime(1546210800)
GROUP BY time
ORDER BY time
1514761200
– 2018-01-01
1546210800
– 2018-12-31
这个returns:
time metric
2018-01-01 345
2018-04-01 123
我需要:
time metric
2018-01-01 345
2018-04-01 123
2018-07-01 0
2018-10-01 0
这是简化的示例,但在实际用例中聚合将是例如。 5 分钟而不是四分之一,并且 GROUP BY 将至少有一个属性,如 GROUP BY attribute1, time
所以期望的结果是
time metric attribute1
2018-01-01 345 1
2018-01-01 345 2
2018-04-01 123 1
2018-04-01 123 2
2018-07-01 0 1
2018-07-01 0 2
2018-10-01 0 1
2018-10-01 0 2
有没有办法以某种方式填充整个给定间隔?就像 InfluxDB 有 fill
组参数或 TimescaleDb 的 time_bucket()
function with generate_series()
我试图搜索 ClickHouse 文档和 github 问题,似乎这还没有实现,所以问题可能是是否有任何解决方法。
您可以使用 "number" 函数生成零值。然后使用 UNION ALL 加入您的查询和零值,并根据我们获得的数据进行 GROUP BY。
因此,您的查询将如下所示:
SELECT SUM(metric),
time
FROM (
SELECT toStartOfQuarter(toDate(1514761200+number*30*24*3600)) time,
toUInt16(0) AS metric
FROM numbers(30)
UNION ALL
SELECT toStartOfQuarter(created_at) AS time,
metric
FROM mytable
WHERE created_at >= toDate(1514761200)
AND created_at >= toDateTime(1514761200)
AND created_at <= toDate(1546210800)
AND created_at <= toDateTime(1546210800)
)
GROUP BY time
ORDER BY time
注意 UInt16(0) - 零值必须与 metrics
在某些情况下 numbers()
函数的替代方法 range
和数组函数可能很有用。
示例:应生成前 7 天的每一对 (id1,id2) 日期。
SELECT
id1,
id2,
arrayJoin(
arrayMap( x -> today() - 7 + x, range(7) )
) as date2
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2
select 的结果可以用在 UNION ALL 中来填充 'holes' 的数据。
SELECT id1, id2, date, sum(column1)
FROM (
SELECT
id1,
id2,
date,
column1
FROM table
WHERE date >= now() - 7
UNION ALL
SELECT
id1,
id2,
arrayJoin(
arrayMap( x -> today() - 7 + x, range(7) )
) as date2,
0 as column1
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2
)
GROUP BY id1, id2, date
ORDER BY date, id1, id2
以下是我如何处理小时桶(需要在 Grafana 中可视化) 感谢@filimonov 和@mikhail
SELECT t, SUM(metric) as metric FROM (
SELECT
arrayJoin(
arrayMap( x -> toStartOfHour(addHours(toDateTime($from),x)),
range(toUInt64(
dateDiff('hour',
toDateTime($from),
toDateTime($to)) + 1)))
) as t,
0 as metric
UNION ALL
SELECT
toStartOfHour(my_date) as t,
COUNT(metric)
FROM my_table
WHERE t BETWEEN toDateTime($from) AND toDateTime($to)
GROUP BY t
)
GROUP BY t ORDER BY t
因此,例如,对于从 2019-01-01 到 2019-01-02 的范围,它将为您提供:
SELECT t, SUM(metric) as metric FROM (
SELECT
arrayJoin(
arrayMap( x -> toStartOfHour(addHours(toDateTime('2019-01-01 00:00:00'),x)),
range(toUInt64(
dateDiff('hour',
toDateTime('2019-01-01 00:00:00'),
toDateTime('2019-01-02 00:00:00')) + 1)))
) as t,
0 as metric
UNION ALL
SELECT
toStartOfHour(my_date) as t,
COUNT(1) as metric
FROM my_table
WHERE t BETWEEN toDateTime('2019-01-01 00:00:00') AND toDateTime('2019-01-02 00:00:00')
GROUP BY t
)
GROUP BY t ORDER BY t;
t |metric|
-------------------|------|
2019-01-01 00:00:00| 0|
2019-01-01 01:00:00| 0|
2019-01-01 02:00:00| 0|
2019-01-01 03:00:00| 0|
2019-01-01 04:00:00| 0|
2019-01-01 05:00:00| 0|
2019-01-01 06:00:00| 0|
2019-01-01 07:00:00|105702|
2019-01-01 08:00:00|113315|
2019-01-01 09:00:00|149837|
2019-01-01 10:00:00|185314|
2019-01-01 11:00:00|246106|
2019-01-01 12:00:00|323036|
2019-01-01 13:00:00| 0|
2019-01-01 14:00:00|409160|
2019-01-01 15:00:00|379113|
2019-01-01 16:00:00|256634|
2019-01-01 17:00:00|286601|
2019-01-01 18:00:00|280039|
2019-01-01 19:00:00|248504|
2019-01-01 20:00:00|218642|
2019-01-01 21:00:00|186152|
2019-01-01 22:00:00|148478|
2019-01-01 23:00:00|109721|
2019-01-02 00:00:00| 0|
从 ClickHouse 19.14 开始,您可以使用 WITH FILL
子句。它可以这样填充宿舍:
WITH
(
SELECT toRelativeQuarterNum(toDate('1970-01-01'))
) AS init
SELECT
-- build the date from the relative quarter number
toDate('1970-01-01') + toIntervalQuarter(q - init) AS time,
metric
FROM
(
SELECT
toRelativeQuarterNum(created_at) AS q,
sum(rand()) AS metric
FROM
(
-- generate some dates and metrics values with gaps
SELECT toDate(arrayJoin(range(1514761200, 1546210800, ((60 * 60) * 24) * 180))) AS created_at
)
GROUP BY q
ORDER BY q ASC WITH FILL FROM toRelativeQuarterNum(toDate(1514761200)) TO toRelativeQuarterNum(toDate(1546210800)) STEP 1
)
┌───────time─┬─────metric─┐
│ 2018-01-01 │ 2950782089 │
│ 2018-04-01 │ 2972073797 │
│ 2018-07-01 │ 0 │
│ 2018-10-01 │ 179581958 │
└────────────┴────────────┘