是否可以在完全具体化视图之前回答对视图的查询?
Is it possible to answer queries on a view before fully materializing the view?
简而言之:Left Join 左侧的 Distinct,Min,Max,无需进行连接即可回答。
我正在使用 SQL 数组类型(在 Postgres 9.3 上)将多行数据压缩到一行中,然后查看 return 未嵌套的规范化视图。我这样做是为了节省索引成本,以及让 Postgres 压缩数组中的数据。
事情运行得很好,但是一些无需取消嵌套就可以回答的查询和 materializing/exploding 视图非常昂贵,因为它们被推迟到视图具体化之后。有什么办法可以解决这个问题吗?
这是基本的table:
CREATE TABLE mt_count_by_day
(
run_id integer NOT NULL,
type character varying(64) NOT NULL,
start_day date NOT NULL,
end_day date NOT NULL,
counts bigint[] NOT NULL,
CONSTRAINT mt_count_by_day_pkey PRIMARY KEY (run_id, type),
)
“类型”索引只是为了很好的衡量:
CREATE INDEX runinfo_mt_count_by_day_type_idx on runinfo.mt_count_by_day (type);
这里是使用generate_series和unnest
的视图
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT mt_count_by_day.run_id,
mt_count_by_day.type,
mt_count_by_day.brand,
generate_series(mt_count_by_day.start_day::timestamp without time zone, mt_count_by_day.end_day - '1 day'::interval, '1 day'::interval) AS row_date,
unnest(mt_count_by_day.counts) AS row_count
FROM runinfo.mt_count_by_day;
如果我想对“类型”列进行区分怎么办?
explain analyze select distinct(type) from mt_count_by_day;
"HashAggregate (cost=9566.81..9577.28 rows=1047 width=19) (actual time=171.653..172.019 rows=1221 loops=1)"
" -> Seq Scan on mt_count_by_day (cost=0.00..9318.25 rows=99425 width=19) (actual time=0.089..99.110 rows=99425 loops=1)"
"Total runtime: 172.338 ms"
现在如果我在视图上做同样的事情会发生什么?
explain analyze select distinct(type) from v_mt_count_by_day;
"HashAggregate (cost=1749752.88..1749763.34 rows=1047 width=19) (actual time=58586.934..58587.191 rows=1221 loops=1)"
" -> Subquery Scan on v_mt_count_by_day (cost=0.00..1501190.38 rows=99425000 width=19) (actual time=0.114..37134.349 rows=68299959 loops=1)"
" -> Seq Scan on mt_count_by_day (cost=0.00..506940.38 rows=99425000 width=597) (actual time=0.113..24907.147 rows=68299959 loops=1)"
"Total runtime: 58587.474 ms"
有没有办法让 postgres 认识到它可以解决这个问题而无需先爆炸视图?
在这里我们可以看到为了比较我们正在计算 table 与视图中匹配条件的行数。一切都按预期工作。 Postgres 在具体化视图之前过滤行。不完全一样,但是 属性 让我们的数据更易于管理。
explain analyze select count(*) from mt_count_by_day where type = ’SOCIAL_GOOGLE'
"Aggregate (cost=157.01..157.02 rows=1 width=0) (actual time=0.538..0.538 rows=1 loops=1)"
" -> Bitmap Heap Scan on mt_count_by_day (cost=4.73..156.91 rows=40 width=0) (actual time=0.139..0.509 rows=122 loops=1)"
" Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
" -> Bitmap Index Scan on runinfo_mt_count_by_day_type_idx (cost=0.00..4.72 rows=40 width=0) (actual time=0.098..0.098 rows=122 loops=1)"
" Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 0.625 ms"
explain analyze select count(*) from v_mt_count_by_day where type = 'SOCIAL_GOOGLE'
"Aggregate (cost=857.11..857.12 rows=1 width=0) (actual time=6.827..6.827 rows=1 loops=1)"
" -> Bitmap Heap Scan on mt_count_by_day (cost=4.73..357.11 rows=40000 width=597) (actual time=0.124..5.294 rows=15916 loops=1)"
" Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
" -> Bitmap Index Scan on runinfo_mt_count_by_day_type_idx (cost=0.00..4.72 rows=40 width=0) (actual time=0.082..0.082 rows=122 loops=1)"
" Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 6.885 ms"
这是重现此代码所需的代码:
CREATE TABLE base_table
(
run_id integer NOT NULL,
type integer NOT NULL,
start_day date NOT NULL,
end_day date NOT NULL,
counts bigint[] NOT NULL
CONSTRAINT match_check CHECK (end_day > start_day AND (end_day - start_day) = array_length(counts, 1)),
CONSTRAINT base_table_pkey PRIMARY KEY (run_id, type)
);
--Just because...
CREATE INDEX base_type_idx on base_table (type);
CREATE OR REPLACE VIEW v_foo AS
SELECT m.run_id,
m.type,
t.row_date::date,
t.row_count
FROM base_table m
LEFT JOIN LATERAL ROWS FROM (
unnest(m.counts),
generate_series(m.start_day, m.end_day-1, interval '1d')
) t(row_count, row_date) ON true;
insert into base_table
select a.run_id, a.type, '20120101'::date as start_day, '20120401'::date as end_day, b.counts from (SELECT N AS run_id, L as type
FROM
generate_series(1, 10000) N
CROSS JOIN
generate_series(1, 7) L
ORDER BY N, L) a, (SELECT array_agg(generate_series)::bigint[] as counts FROM generate_series(1, 91) ) b
9.4.1 的结果:
解释分析 select 不同于 base_table 的类型;
"HashAggregate (cost=6750.00..6750.03 rows=3 width=4) (actual time=51.939..51.940 rows=3 loops=1)"
" Group Key: type"
" -> Seq Scan on base_table (cost=0.00..6600.00 rows=60000 width=4) (actual time=0.030..33.655 rows=60000 loops=1)"
"Planning time: 0.086 ms"
"Execution time: 51.975 ms"
解释分析 select 来自 v_foo 的不同类型;
"HashAggregate (cost=1356600.01..1356600.04 rows=3 width=4) (actual time=9215.630..9215.630 rows=3 loops=1)"
" Group Key: m.type"
" -> Nested Loop Left Join (cost=0.01..1206600.01 rows=60000000 width=4) (actual time=0.112..7834.094 rows=5460000 loops=1)"
" -> Seq Scan on base_table m (cost=0.00..6600.00 rows=60000 width=764) (actual time=0.009..42.694 rows=60000 loops=1)"
" -> Function Scan on t (cost=0.01..10.01 rows=1000 width=0) (actual time=0.091..0.111 rows=91 loops=60000)"
"Planning time: 0.132 ms"
"Execution time: 9215.686 ms"
通常,Postgres 查询规划器会“内联”视图来优化整个查询。 Per documentation:
One application of the rewrite system is in the realization of views.
Whenever a query against a view (i.e., a virtual table) is made, the
rewrite system rewrites the user's query to a query that accesses the
base tables given in the view definition instead.
但我不认为 Postgres 足够聪明得出结论,它可以从基 table 得到相同的结果而不会爆炸行。
您可以使用 LATERAL
联接尝试此替代查询。它更干净:
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
, m.start_day + c.rn - 1 AS row_date
, c.row_count
FROM runinfo.mt_count_by_day m
LEFT JOIN LATERAL unnest(m.counts) WITH ORDINALITY c(row_count, rn) ON true;
这也清楚地表明 (end_day
, start_day
) 之一是多余的。
使用 LEFT JOIN
因为这可能允许查询规划器忽略查询中的连接:
SELECT DISTINCT type FROM v_mt_count_by_day;
否则(使用 CROSS JOIN
或 INNER JOIN
)它 必须 评估连接以查看是否消除了第一个 table 中的行.
顺便说一句,它是:
SELECT DISTINCT type ...
不是:
<strike>SELECT DISTINCT(type)</strike> ...
请注意,此 return 是 date
而不是原始时间戳。更简单,我猜这就是你想要的?
需要 Postgres 9.3+ 详细信息:
- PostgreSQL unnest() with element number
ROWS FROM
在 Postgres 9.4+
并行分解两列安全:
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
t.row_date::date, t.row_count
FROM runinfo.mt_count_by_day m
LEFT JOIN LATERAL ROWS FROM (
unnest(m.counts)
, generate_series(m.start_day, m.end_day, interval '1d')
) t(row_count, row_date) ON true;
主要好处:如果两个 SRF 的行数 return 不同,这不会偏离笛卡尔积。相反,将填充 NULL 值。
同样,我不能说这是否会帮助查询计划者在不进行测试的情况下为 DISTINCT type
制定更快的计划。
简而言之:Left Join 左侧的 Distinct,Min,Max,无需进行连接即可回答。
我正在使用 SQL 数组类型(在 Postgres 9.3 上)将多行数据压缩到一行中,然后查看 return 未嵌套的规范化视图。我这样做是为了节省索引成本,以及让 Postgres 压缩数组中的数据。
事情运行得很好,但是一些无需取消嵌套就可以回答的查询和 materializing/exploding 视图非常昂贵,因为它们被推迟到视图具体化之后。有什么办法可以解决这个问题吗?
这是基本的table:
CREATE TABLE mt_count_by_day
(
run_id integer NOT NULL,
type character varying(64) NOT NULL,
start_day date NOT NULL,
end_day date NOT NULL,
counts bigint[] NOT NULL,
CONSTRAINT mt_count_by_day_pkey PRIMARY KEY (run_id, type),
)
“类型”索引只是为了很好的衡量:
CREATE INDEX runinfo_mt_count_by_day_type_idx on runinfo.mt_count_by_day (type);
这里是使用generate_series和unnest
的视图CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT mt_count_by_day.run_id,
mt_count_by_day.type,
mt_count_by_day.brand,
generate_series(mt_count_by_day.start_day::timestamp without time zone, mt_count_by_day.end_day - '1 day'::interval, '1 day'::interval) AS row_date,
unnest(mt_count_by_day.counts) AS row_count
FROM runinfo.mt_count_by_day;
如果我想对“类型”列进行区分怎么办?
explain analyze select distinct(type) from mt_count_by_day;
"HashAggregate (cost=9566.81..9577.28 rows=1047 width=19) (actual time=171.653..172.019 rows=1221 loops=1)"
" -> Seq Scan on mt_count_by_day (cost=0.00..9318.25 rows=99425 width=19) (actual time=0.089..99.110 rows=99425 loops=1)"
"Total runtime: 172.338 ms"
现在如果我在视图上做同样的事情会发生什么?
explain analyze select distinct(type) from v_mt_count_by_day;
"HashAggregate (cost=1749752.88..1749763.34 rows=1047 width=19) (actual time=58586.934..58587.191 rows=1221 loops=1)"
" -> Subquery Scan on v_mt_count_by_day (cost=0.00..1501190.38 rows=99425000 width=19) (actual time=0.114..37134.349 rows=68299959 loops=1)"
" -> Seq Scan on mt_count_by_day (cost=0.00..506940.38 rows=99425000 width=597) (actual time=0.113..24907.147 rows=68299959 loops=1)"
"Total runtime: 58587.474 ms"
有没有办法让 postgres 认识到它可以解决这个问题而无需先爆炸视图?
在这里我们可以看到为了比较我们正在计算 table 与视图中匹配条件的行数。一切都按预期工作。 Postgres 在具体化视图之前过滤行。不完全一样,但是 属性 让我们的数据更易于管理。
explain analyze select count(*) from mt_count_by_day where type = ’SOCIAL_GOOGLE'
"Aggregate (cost=157.01..157.02 rows=1 width=0) (actual time=0.538..0.538 rows=1 loops=1)"
" -> Bitmap Heap Scan on mt_count_by_day (cost=4.73..156.91 rows=40 width=0) (actual time=0.139..0.509 rows=122 loops=1)"
" Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
" -> Bitmap Index Scan on runinfo_mt_count_by_day_type_idx (cost=0.00..4.72 rows=40 width=0) (actual time=0.098..0.098 rows=122 loops=1)"
" Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 0.625 ms"
explain analyze select count(*) from v_mt_count_by_day where type = 'SOCIAL_GOOGLE'
"Aggregate (cost=857.11..857.12 rows=1 width=0) (actual time=6.827..6.827 rows=1 loops=1)"
" -> Bitmap Heap Scan on mt_count_by_day (cost=4.73..357.11 rows=40000 width=597) (actual time=0.124..5.294 rows=15916 loops=1)"
" Recheck Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
" -> Bitmap Index Scan on runinfo_mt_count_by_day_type_idx (cost=0.00..4.72 rows=40 width=0) (actual time=0.082..0.082 rows=122 loops=1)"
" Index Cond: ((type)::text = 'SOCIAL_GOOGLE'::text)"
"Total runtime: 6.885 ms"
这是重现此代码所需的代码:
CREATE TABLE base_table
(
run_id integer NOT NULL,
type integer NOT NULL,
start_day date NOT NULL,
end_day date NOT NULL,
counts bigint[] NOT NULL
CONSTRAINT match_check CHECK (end_day > start_day AND (end_day - start_day) = array_length(counts, 1)),
CONSTRAINT base_table_pkey PRIMARY KEY (run_id, type)
);
--Just because...
CREATE INDEX base_type_idx on base_table (type);
CREATE OR REPLACE VIEW v_foo AS
SELECT m.run_id,
m.type,
t.row_date::date,
t.row_count
FROM base_table m
LEFT JOIN LATERAL ROWS FROM (
unnest(m.counts),
generate_series(m.start_day, m.end_day-1, interval '1d')
) t(row_count, row_date) ON true;
insert into base_table
select a.run_id, a.type, '20120101'::date as start_day, '20120401'::date as end_day, b.counts from (SELECT N AS run_id, L as type
FROM
generate_series(1, 10000) N
CROSS JOIN
generate_series(1, 7) L
ORDER BY N, L) a, (SELECT array_agg(generate_series)::bigint[] as counts FROM generate_series(1, 91) ) b
9.4.1 的结果:
解释分析 select 不同于 base_table 的类型;
"HashAggregate (cost=6750.00..6750.03 rows=3 width=4) (actual time=51.939..51.940 rows=3 loops=1)"
" Group Key: type"
" -> Seq Scan on base_table (cost=0.00..6600.00 rows=60000 width=4) (actual time=0.030..33.655 rows=60000 loops=1)"
"Planning time: 0.086 ms"
"Execution time: 51.975 ms"
解释分析 select 来自 v_foo 的不同类型;
"HashAggregate (cost=1356600.01..1356600.04 rows=3 width=4) (actual time=9215.630..9215.630 rows=3 loops=1)"
" Group Key: m.type"
" -> Nested Loop Left Join (cost=0.01..1206600.01 rows=60000000 width=4) (actual time=0.112..7834.094 rows=5460000 loops=1)"
" -> Seq Scan on base_table m (cost=0.00..6600.00 rows=60000 width=764) (actual time=0.009..42.694 rows=60000 loops=1)"
" -> Function Scan on t (cost=0.01..10.01 rows=1000 width=0) (actual time=0.091..0.111 rows=91 loops=60000)"
"Planning time: 0.132 ms"
"Execution time: 9215.686 ms"
通常,Postgres 查询规划器会“内联”视图来优化整个查询。 Per documentation:
One application of the rewrite system is in the realization of views. Whenever a query against a view (i.e., a virtual table) is made, the rewrite system rewrites the user's query to a query that accesses the base tables given in the view definition instead.
但我不认为 Postgres 足够聪明得出结论,它可以从基 table 得到相同的结果而不会爆炸行。
您可以使用 LATERAL
联接尝试此替代查询。它更干净:
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
, m.start_day + c.rn - 1 AS row_date
, c.row_count
FROM runinfo.mt_count_by_day m
LEFT JOIN LATERAL unnest(m.counts) WITH ORDINALITY c(row_count, rn) ON true;
这也清楚地表明 (end_day
, start_day
) 之一是多余的。
使用 LEFT JOIN
因为这可能允许查询规划器忽略查询中的连接:
SELECT DISTINCT type FROM v_mt_count_by_day;
否则(使用 CROSS JOIN
或 INNER JOIN
)它 必须 评估连接以查看是否消除了第一个 table 中的行.
顺便说一句,它是:
SELECT DISTINCT type ...
不是:
<strike>SELECT DISTINCT(type)</strike> ...
请注意,此 return 是 date
而不是原始时间戳。更简单,我猜这就是你想要的?
需要 Postgres 9.3+ 详细信息:
- PostgreSQL unnest() with element number
ROWS FROM
在 Postgres 9.4+
并行分解两列安全:
CREATE OR REPLACE VIEW runinfo.v_mt_count_by_day AS
SELECT m.run_id, m.type, m.brand
t.row_date::date, t.row_count
FROM runinfo.mt_count_by_day m
LEFT JOIN LATERAL ROWS FROM (
unnest(m.counts)
, generate_series(m.start_day, m.end_day, interval '1d')
) t(row_count, row_date) ON true;
主要好处:如果两个 SRF 的行数 return 不同,这不会偏离笛卡尔积。相反,将填充 NULL 值。
同样,我不能说这是否会帮助查询计划者在不进行测试的情况下为 DISTINCT type
制定更快的计划。