Vertica - 是否有横向视图功能?
Vertica - Is there LATERAL VIEW functionality?
需要旋转矩阵以进行 TIMESERIES 插值/间隙填充,并希望避免混乱且低效的 UNION ALL 方法。 Vertica 中是否有类似 Hive 的 LATERAL VIEW EXPLODE 功能?
编辑:
@marcothesane——感谢您提供有趣的场景——我喜欢您的插值方法。我会更多地尝试它,看看它是如何进行的。看起来很有前途。
仅供参考——这是我想出的解决方案——我的场景是我试图通过查询(和用户/资源池等)查看内存使用情况,基本上是试图获得成本指标).我需要进行插值,以便可以在任何时间点查看总使用量。所以这是我的查询,它按秒进行时间序列切片,然后聚合以按分钟给出 "Megabyte_Seconds" 的指标。
with qry_cte as
(
select
session_id
, request_id
, date_trunc('second',start_timestamp) as dat_str
, timestampadd('ss'
, ceiling(request_duration_ms/1000)::int
, date_trunc('second',start_timestamp)
) as dat_end
, ceiling(request_duration_ms/1000)::int as secs
, memory_acquired_mb
from query_requests
where request_type = 'QUERY'
and request_duration_ms > 0
and memory_acquired_mb > 0
)
select date_trunc('minute',slice_time) as dat_minute
, count(distinct session_id || request_id::varchar) as queries
, sum(memory_acquired_mb) as mb_seconds
from (
select session_id, request_id, slice_time, ts_first_value(memory_acquired_mb) as memory_acquired_mb
from (
select session_id, request_id, dat_str as dat, memory_acquired_mb from qry_cte
union all
select session_id, request_id, dat_end as dat, memory_acquired_mb from qry_cte
) x
timeseries slice_time as '1 second' over (partition by session_id, request_id order by dat)
) x
group by 1 order by 1 desc
;
实际上我手头有一个场景可以满足您的要求:
其中:
id|day_strt |sales_01 |sales_02 |sales_03 |sales_04 |sales_05 |sales_06
1|2016-01-19 08:00:00| 1,842.25| 5,449.40|- |39,776.86|- | 9,424.10
2|2016-01-19 08:00:00|73,810.66|- | 9,867.70|- |76,723.91|95,605.14
做这个:
id|day_strt |sales_01 |sales_02 |sales_03 |sales_04 |sales_05 |sales_06
1|2016-01-19 08:00:00| 1,842.25| 5,449.40|22,613.13|39,776.86|24,600.48| 9,424.10
2|2016-01-19 08:00:00|73,810.66|41,839.18| 9,867.70|43,295.81|76,723.91|95,605.14
01 到 06 指的是记录销售额的一天中的 n-th 小时,从 08:00 开始。
下面是整个场景,包括初始输入数据。
- 输入数据作为 SELECT .. UNION ALL SELECT ...
- 一个table,由6个整数组成,交叉连接到1的table。
- 垂直轴:将输入与 6 个整数交叉连接,并根据索引,仅输出 CASE 表达式中的 n-th 销售列。最后,过滤掉任何相同 CASE 表达式求值为 NULL 的地方。
- 使用 TIMESERIES 子句和线性插值填补空白:销售数字和索引列。
- 在最终查询中再次水平旋转所有内容。
在 table 的所有列上比 UNION ALL 性能更高,我可以向你保证。
这里是:
WITH
-- input
input(id,day_strt,sales_01,sales_02,sales_03,sales_04,sales_05,sales_06) AS (
SELECT 1,'2016-01-19 08:00:00'::TIMESTAMP(0), 1842.25, 5449.40 ,NULL::INT,39776.86 ,NULL::INT, 9424.10
UNION ALL SELECT 2,'2016-01-19 08:00:00'::TIMESTAMP(0),73810.66 ,NULL::INT, 9867.70 ,NULL::INT,76723.91 ,95605.14
)
-- debug
-- SELECT * FROM input;
,
-- 6 months to pivot vertically -> 6 integers
six_idxs(idx) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
,
-- pivot input vertically and remove rows with null measures
-- (could probably add the TIMESERIES clause here directly,
-- but less readable and maintainable)
vert_pivot AS (
SELECT
id
, idx
, TIMESTAMPADD(HOUR,idx-1,day_strt)::TIMESTAMP(0) AS sales_ts
, CASE idx
WHEN 1 THEN sales_01
WHEN 2 THEN sales_02
WHEN 3 THEN sales_03
WHEN 4 THEN sales_04
WHEN 5 THEN sales_05
WHEN 6 THEN sales_06
END AS sales
FROM input
CROSS JOIN six_idxs
WHERE (
CASE idx
WHEN 1 THEN sales_01
WHEN 2 THEN sales_02
WHEN 3 THEN sales_03
WHEN 4 THEN sales_04
WHEN 5 THEN sales_05
WHEN 6 THEN sales_06
END
) IS NOT NULL
)
-- debug:
-- SELECT * FROM vert_pivot;
,
-- gap filling and interpolation
gaps_filled AS (
SELECT
id
, TS_FIRST_VALUE(idx,'LINEAR') AS idx
, tm_sales_ts::TIMESTAMP(0) AS sales_ts
, TS_FIRST_VALUE(sales,'LINEAR') AS sales
FROM vert_pivot
TIMESERIES tm_sales_ts AS '1 HOUR' OVER(
PARTITION BY id ORDER BY sales_ts
)
)
-- debug
-- SELECT * FROM gaps_filled ORDER BY 1,2;
-- pivot horizontally; final query
SELECT
id
, MIN(sales_ts) AS day_strt
, SUM(CASE idx WHEN 1 THEN sales END)::NUMERIC(7,2) AS sales_01
, SUM(CASE idx WHEN 2 THEN sales END)::NUMERIC(7,2) AS sales_02
, SUM(CASE idx WHEN 3 THEN sales END)::NUMERIC(7,2) AS sales_03
, SUM(CASE idx WHEN 4 THEN sales END)::NUMERIC(7,2) AS sales_04
, SUM(CASE idx WHEN 5 THEN sales END)::NUMERIC(7,2) AS sales_05
, SUM(CASE idx WHEN 6 THEN sales END)::NUMERIC(7,2) AS sales_06
FROM gaps_filled
GROUP BY id
ORDER BY id
;
玩的开心 -
理智的马可
需要旋转矩阵以进行 TIMESERIES 插值/间隙填充,并希望避免混乱且低效的 UNION ALL 方法。 Vertica 中是否有类似 Hive 的 LATERAL VIEW EXPLODE 功能?
编辑: @marcothesane——感谢您提供有趣的场景——我喜欢您的插值方法。我会更多地尝试它,看看它是如何进行的。看起来很有前途。
仅供参考——这是我想出的解决方案——我的场景是我试图通过查询(和用户/资源池等)查看内存使用情况,基本上是试图获得成本指标).我需要进行插值,以便可以在任何时间点查看总使用量。所以这是我的查询,它按秒进行时间序列切片,然后聚合以按分钟给出 "Megabyte_Seconds" 的指标。
with qry_cte as
(
select
session_id
, request_id
, date_trunc('second',start_timestamp) as dat_str
, timestampadd('ss'
, ceiling(request_duration_ms/1000)::int
, date_trunc('second',start_timestamp)
) as dat_end
, ceiling(request_duration_ms/1000)::int as secs
, memory_acquired_mb
from query_requests
where request_type = 'QUERY'
and request_duration_ms > 0
and memory_acquired_mb > 0
)
select date_trunc('minute',slice_time) as dat_minute
, count(distinct session_id || request_id::varchar) as queries
, sum(memory_acquired_mb) as mb_seconds
from (
select session_id, request_id, slice_time, ts_first_value(memory_acquired_mb) as memory_acquired_mb
from (
select session_id, request_id, dat_str as dat, memory_acquired_mb from qry_cte
union all
select session_id, request_id, dat_end as dat, memory_acquired_mb from qry_cte
) x
timeseries slice_time as '1 second' over (partition by session_id, request_id order by dat)
) x
group by 1 order by 1 desc
;
实际上我手头有一个场景可以满足您的要求:
其中:
id|day_strt |sales_01 |sales_02 |sales_03 |sales_04 |sales_05 |sales_06
1|2016-01-19 08:00:00| 1,842.25| 5,449.40|- |39,776.86|- | 9,424.10
2|2016-01-19 08:00:00|73,810.66|- | 9,867.70|- |76,723.91|95,605.14
做这个:
id|day_strt |sales_01 |sales_02 |sales_03 |sales_04 |sales_05 |sales_06
1|2016-01-19 08:00:00| 1,842.25| 5,449.40|22,613.13|39,776.86|24,600.48| 9,424.10
2|2016-01-19 08:00:00|73,810.66|41,839.18| 9,867.70|43,295.81|76,723.91|95,605.14
01 到 06 指的是记录销售额的一天中的 n-th 小时,从 08:00 开始。
下面是整个场景,包括初始输入数据。
- 输入数据作为 SELECT .. UNION ALL SELECT ...
- 一个table,由6个整数组成,交叉连接到1的table。
- 垂直轴:将输入与 6 个整数交叉连接,并根据索引,仅输出 CASE 表达式中的 n-th 销售列。最后,过滤掉任何相同 CASE 表达式求值为 NULL 的地方。
- 使用 TIMESERIES 子句和线性插值填补空白:销售数字和索引列。
- 在最终查询中再次水平旋转所有内容。
在 table 的所有列上比 UNION ALL 性能更高,我可以向你保证。
这里是:
WITH
-- input
input(id,day_strt,sales_01,sales_02,sales_03,sales_04,sales_05,sales_06) AS (
SELECT 1,'2016-01-19 08:00:00'::TIMESTAMP(0), 1842.25, 5449.40 ,NULL::INT,39776.86 ,NULL::INT, 9424.10
UNION ALL SELECT 2,'2016-01-19 08:00:00'::TIMESTAMP(0),73810.66 ,NULL::INT, 9867.70 ,NULL::INT,76723.91 ,95605.14
)
-- debug
-- SELECT * FROM input;
,
-- 6 months to pivot vertically -> 6 integers
six_idxs(idx) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
)
,
-- pivot input vertically and remove rows with null measures
-- (could probably add the TIMESERIES clause here directly,
-- but less readable and maintainable)
vert_pivot AS (
SELECT
id
, idx
, TIMESTAMPADD(HOUR,idx-1,day_strt)::TIMESTAMP(0) AS sales_ts
, CASE idx
WHEN 1 THEN sales_01
WHEN 2 THEN sales_02
WHEN 3 THEN sales_03
WHEN 4 THEN sales_04
WHEN 5 THEN sales_05
WHEN 6 THEN sales_06
END AS sales
FROM input
CROSS JOIN six_idxs
WHERE (
CASE idx
WHEN 1 THEN sales_01
WHEN 2 THEN sales_02
WHEN 3 THEN sales_03
WHEN 4 THEN sales_04
WHEN 5 THEN sales_05
WHEN 6 THEN sales_06
END
) IS NOT NULL
)
-- debug:
-- SELECT * FROM vert_pivot;
,
-- gap filling and interpolation
gaps_filled AS (
SELECT
id
, TS_FIRST_VALUE(idx,'LINEAR') AS idx
, tm_sales_ts::TIMESTAMP(0) AS sales_ts
, TS_FIRST_VALUE(sales,'LINEAR') AS sales
FROM vert_pivot
TIMESERIES tm_sales_ts AS '1 HOUR' OVER(
PARTITION BY id ORDER BY sales_ts
)
)
-- debug
-- SELECT * FROM gaps_filled ORDER BY 1,2;
-- pivot horizontally; final query
SELECT
id
, MIN(sales_ts) AS day_strt
, SUM(CASE idx WHEN 1 THEN sales END)::NUMERIC(7,2) AS sales_01
, SUM(CASE idx WHEN 2 THEN sales END)::NUMERIC(7,2) AS sales_02
, SUM(CASE idx WHEN 3 THEN sales END)::NUMERIC(7,2) AS sales_03
, SUM(CASE idx WHEN 4 THEN sales END)::NUMERIC(7,2) AS sales_04
, SUM(CASE idx WHEN 5 THEN sales END)::NUMERIC(7,2) AS sales_05
, SUM(CASE idx WHEN 6 THEN sales END)::NUMERIC(7,2) AS sales_06
FROM gaps_filled
GROUP BY id
ORDER BY id
;
玩的开心 -
理智的马可