左连接导致查询解析时间大幅增加
left join causes huge increase in time for query resolution
以下查询帮助我计算平均时间间隔内分布的历史值的平均值。
EXPLAIN ANALYZE SELECT start_date as date, AVG(hcv1.value::float) as value
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv
ON (
hv.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
这里是查询的报告:https://explain.depesz.com/s/q29a
现在,如果我尝试添加指向另一个 variable_id 的额外列 value2,查询时间将从 2 秒变为 150 秒:
EXPLAIN ANALYZE SELECT start_date as date,
AVG(hv1.value::float) as value1,
AVG(hv2.value::float) as value2
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv1
ON (
hv1.variable_id = 2 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
LEFT JOIN history_values hv2
ON (
hv2.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
报告如下:https://explain.depesz.com/s/V1sV
谁能告诉我为什么?我真的希望时间在 4 秒左右,而不是差不多 75 倍。
另请注意:
SELECT COUNT(*) FROM history_values WHERE variable_id = 2 -- ~25k records
SELECT COUNT(*) FROM history_values WHERE variable_id = 3 -- ~25k records
您不是在添加额外的列,而是在添加另一个连接条件。而且你不需要额外的连接..
试试,只过滤 avg()
EXPLAIN ANALYZE
SELECT start_date as date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
cast('2017-01-01' as abstime)
, cast('2017-12-01' as abstime),
, interval '86400 seconds'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at >= cast('2017-01-01' as abstime) AND
hv1.created_at <= cast('2017-12-01' as abstime) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
附带说明一下,您不应该使用 abstime
。那应该仅供内部使用。相反,我会使用
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01',
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at BETWEEN (
timestamp with time zone '2017-01-01'
AND timestamp with time zone '2017-12-01'
) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '1 day' AND
hv1.variable_id IN (1,2)
)
GROUP BY start_date
ORDER BY start_date
我还认为你可以将这些范围缩小..
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01' - interval '1 day'
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON hv1.created_at BETWEEN start_date AND (start_date + interval '1 day' )
AND hv1.variable_id IN (1,2)
GROUP BY start_date
ORDER BY start_date
以后,请在 http://dba.stackexchange.com 上提出有关 PostgreSQL 的问题。我会把它标记为在那里迁移。管理员会很乐意移动它。
以下查询帮助我计算平均时间间隔内分布的历史值的平均值。
EXPLAIN ANALYZE SELECT start_date as date, AVG(hcv1.value::float) as value
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv
ON (
hv.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
这里是查询的报告:https://explain.depesz.com/s/q29a
现在,如果我尝试添加指向另一个 variable_id 的额外列 value2,查询时间将从 2 秒变为 150 秒:
EXPLAIN ANALYZE SELECT start_date as date,
AVG(hv1.value::float) as value1,
AVG(hv2.value::float) as value2
FROM generate_series(cast('2017-01-01' as abstime), cast('2017-12-01' as abstime), interval '86400 seconds') start_date
LEFT JOIN history_values hv1
ON (
hv1.variable_id = 2 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
LEFT JOIN history_values hv2
ON (
hv2.variable_id = 3 AND
hv.created_at BETWEEN start_date AND start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
报告如下:https://explain.depesz.com/s/V1sV
谁能告诉我为什么?我真的希望时间在 4 秒左右,而不是差不多 75 倍。
另请注意:
SELECT COUNT(*) FROM history_values WHERE variable_id = 2 -- ~25k records
SELECT COUNT(*) FROM history_values WHERE variable_id = 3 -- ~25k records
您不是在添加额外的列,而是在添加另一个连接条件。而且你不需要额外的连接..
试试,只过滤 avg()
EXPLAIN ANALYZE
SELECT start_date as date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
cast('2017-01-01' as abstime)
, cast('2017-12-01' as abstime),
, interval '86400 seconds'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at >= cast('2017-01-01' as abstime) AND
hv1.created_at <= cast('2017-12-01' as abstime) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '86400 seconds'
)
GROUP BY start_date
ORDER BY start_date
附带说明一下,您不应该使用 abstime
。那应该仅供内部使用。相反,我会使用
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01',
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON (
hv1.created_at BETWEEN (
timestamp with time zone '2017-01-01'
AND timestamp with time zone '2017-12-01'
) AND
hv1.created_at >= start_date AND
hv1.created_at < start_date + interval '1 day' AND
hv1.variable_id IN (1,2)
)
GROUP BY start_date
ORDER BY start_date
我还认为你可以将这些范围缩小..
EXPLAIN ANALYZE
SELECT start_date::date AS date,
AVG(hv1.value::float) FILTER ( WHERE hv1.variable_id = 1 ) as value1,
AVG(hv2.value::float) FILTER ( WHERE hv1.variable_id = 2 ) as value2
FROM generate_series(
timestamp with time zone '2017-01-01',
timestamp with time zone '2017-12-01' - interval '1 day'
interval '1 day'
) AS start_date
LEFT JOIN history_values hv1
ON hv1.created_at BETWEEN start_date AND (start_date + interval '1 day' )
AND hv1.variable_id IN (1,2)
GROUP BY start_date
ORDER BY start_date
以后,请在 http://dba.stackexchange.com 上提出有关 PostgreSQL 的问题。我会把它标记为在那里迁移。管理员会很乐意移动它。