我怎样才能使这个查询 运行 在 postgres 中更快
How can I make this query run faster in postgres
我有这个查询需要 86 秒才能执行。
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr,
's' stock_type
from customer, stock, date
where customer_k = stock_customer_k
and stock_soldate_k = date_k
group by cust_id, cust_first_name, cust_last_name, cust_prf, cust_birth_country, cust_login, cust_email_address, date_year;
解释分析结果:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=639753.55..764040.06 rows=2616558 width=213) (actual time=81192.575..86536.398 rows=190581 loops=1)
Group Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
-> Sort (cost=639753.55..646294.95 rows=2616558 width=213) (actual time=81192.468..83977.960 rows=2685453 loops=1)
Sort Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
Sort Method: external merge Disk: 460920kB
-> Hash Join (cost=6527.66..203691.58 rows=2616558 width=213) (actual time=60.500..2306.082 rows=2685453 loops=1)
Hash Cond: (stock.stock_customer_k = customer.customer_k)
-> Merge Join (cost=1423.66..144975.59 rows=2744641 width=30) (actual time=8.820..1412.109 rows=2750311 loops=1)
Merge Cond: (date.date_k = stock.stock_soldate_k)
-> Index Scan using date_key_idx on date (cost=0.29..2723.33 rows=73049 width=8) (actual time=0.013..7.164 rows=37622 loops=1)
-> Index Scan using stock_soldate_k_index on stock (cost=0.43..108829.12 rows=2880404 width=30) (actual time=0.004..735.043 rows=2750312 loops=1)
-> Hash (cost=3854.00..3854.00 rows=100000 width=191) (actual time=51.650..51.650rows=100000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 16139kB
-> Seq Scan on customer (cost=0.00..3854.00 rows=100000 width=191) (actual time=0.004..30.341 rows=100000 loops=1)
Planning time: 1.761 ms
Execution time: 86621.807 ms
我有work_mem=512MB
。我创建了索引
cust_id
、customer_k
、stock_customer_k
、stock_soldate_k
和 date_k
。
customer
中有大约 100,000 行,stock
中有 3,000,000 行,date
中有 80,000 行。
如何使此查询 运行 更快?
如果有任何帮助,我将不胜感激!
TABLE 定义
日期
Column | Type | Modifiers
---------------------+---------------+-----------
date_k | integer | not null
date_id | character(16) | not null
date_date | date |
date_year | integer |
现货
Column | Type | Modifiers
-----------------------+--------------+-----------
stock_soldate_k | integer |
stock_soltime_k | integer |
stock_customer_k | integer |
stock_ds_price | numeric(7,2) |
stock_es_price | numeric(7,2) |
stock_ls_price | numeric(7,2) |
stock_ws_price | numeric(7,2) |
顾客:
Column | Type | Modifiers
---------------------------+-----------------------+-----------
customer_k | integer | not null
customer_id | character(16) | not null
cust_first_name | character(20) |
cust_last_name | character(30) |
cust_prf | character(1) |
cust_birth_country | character varying(20) |
cust_login | character(13) |
cust_email_address | character(50) |
TABLE "stock" CONSTRAINT "st1" FOREIGN KEY (stock_soldate_k) REFERENCES date(date_k)
"st2" FOREIGN KEY (stock_customer_k) REFERENCES customer(customer_k)
你得到的一个很大的性能损失是大约 450MB 的中间数据存储在外部:Sort Method: external merge Disk: 460920kB
。发生这种情况是因为规划器首先需要满足 3 table 之间的连接条件,包括可能效率低下的 table customer
,然后聚合 sum()
才能发生,即使而聚合可以单独在 table stock
上完美执行。
查询
因为您的 table 相当大,您最好尽快减少符合条件的行数,最好是在任何加入之前。在这种情况下,这意味着在子查询中对 table stock
进行聚合并将该结果连接到其他两个 tables:
SELECT c.cust_id AS customer_id,
c.cust_first_name AS customer_first_name,
c.cust_last_name AS customer_last_name,
c.cust_prf AS customer_prf,
c.cust_birth_country AS customer_birth_country,
c.cust_login AS customer_login,
c.cust_email_address AS customer_email_address,
d.date_year AS ddyear,
ss.total_yr,
's' stock_type
FROM (
SELECT
stock_customer_k AS ck,
stock_soldate_k AS sdk,
sum((stock_ls_price-stock_ws_price-stock_ds_price+stock_es_price)*0.5) AS total_yr
FROM stock
GROUP BY 1, 2) ss
JOIN customer c ON c.customer_k = ss.ck
JOIN date d ON d.date_k = ss.sdk;
stock
上的子查询将产生更少的行,具体取决于每个客户每个日期的平均订单数。此外,在 sum()
函数中,乘以 0.5 比除以 2 便宜得多(尽管在宏伟的计划中它将相对微不足道)。
数据模型
你也应该认真看看你的数据模型。
在 table customer
中,您使用像 char(30)
这样的数据类型,即使您只存储 'X',它也总是会在行中占用 30 个字节。当许多字符串比声明的最大宽度短时,使用 varchar(30)
数据类型效率更高,因为它占用更少 space,因此需要更少的页面读取(和写入中间数据)。
Table stock
使用 numeric(7,2)
作为价格。使用 numeric
数据类型可能会在对数据进行多次重复操作时给出准确的结果,但它们也非常慢。 double precision
数据类型在您的场景中会更快并且同样准确。出于演示目的,您可以将值四舍五入到所需的精度。
作为建议,使用 double precision
数据类型而不是 numeric
创建 table stock_f
,将所有数据从 stock
复制到 stock_f
和 运行 关于 table.
的查询
试试这个:
with stock_grouped as
(select stock_customer_k, date_year, sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr
from stock, date
where stock_soldate_k = date_k
group by stock_customer_k, date_year)
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
total_yr,
's' stock_type
from customer, stock_grouped
where customer_k = stock_customer_k
此查询预期通过连接进行分组。
我有这个查询需要 86 秒才能执行。
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr,
's' stock_type
from customer, stock, date
where customer_k = stock_customer_k
and stock_soldate_k = date_k
group by cust_id, cust_first_name, cust_last_name, cust_prf, cust_birth_country, cust_login, cust_email_address, date_year;
解释分析结果:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=639753.55..764040.06 rows=2616558 width=213) (actual time=81192.575..86536.398 rows=190581 loops=1)
Group Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
-> Sort (cost=639753.55..646294.95 rows=2616558 width=213) (actual time=81192.468..83977.960 rows=2685453 loops=1)
Sort Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
Sort Method: external merge Disk: 460920kB
-> Hash Join (cost=6527.66..203691.58 rows=2616558 width=213) (actual time=60.500..2306.082 rows=2685453 loops=1)
Hash Cond: (stock.stock_customer_k = customer.customer_k)
-> Merge Join (cost=1423.66..144975.59 rows=2744641 width=30) (actual time=8.820..1412.109 rows=2750311 loops=1)
Merge Cond: (date.date_k = stock.stock_soldate_k)
-> Index Scan using date_key_idx on date (cost=0.29..2723.33 rows=73049 width=8) (actual time=0.013..7.164 rows=37622 loops=1)
-> Index Scan using stock_soldate_k_index on stock (cost=0.43..108829.12 rows=2880404 width=30) (actual time=0.004..735.043 rows=2750312 loops=1)
-> Hash (cost=3854.00..3854.00 rows=100000 width=191) (actual time=51.650..51.650rows=100000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 16139kB
-> Seq Scan on customer (cost=0.00..3854.00 rows=100000 width=191) (actual time=0.004..30.341 rows=100000 loops=1)
Planning time: 1.761 ms
Execution time: 86621.807 ms
我有work_mem=512MB
。我创建了索引
cust_id
、customer_k
、stock_customer_k
、stock_soldate_k
和 date_k
。
customer
中有大约 100,000 行,stock
中有 3,000,000 行,date
中有 80,000 行。
如何使此查询 运行 更快? 如果有任何帮助,我将不胜感激!
TABLE 定义
日期
Column | Type | Modifiers
---------------------+---------------+-----------
date_k | integer | not null
date_id | character(16) | not null
date_date | date |
date_year | integer |
现货
Column | Type | Modifiers
-----------------------+--------------+-----------
stock_soldate_k | integer |
stock_soltime_k | integer |
stock_customer_k | integer |
stock_ds_price | numeric(7,2) |
stock_es_price | numeric(7,2) |
stock_ls_price | numeric(7,2) |
stock_ws_price | numeric(7,2) |
顾客:
Column | Type | Modifiers
---------------------------+-----------------------+-----------
customer_k | integer | not null
customer_id | character(16) | not null
cust_first_name | character(20) |
cust_last_name | character(30) |
cust_prf | character(1) |
cust_birth_country | character varying(20) |
cust_login | character(13) |
cust_email_address | character(50) |
TABLE "stock" CONSTRAINT "st1" FOREIGN KEY (stock_soldate_k) REFERENCES date(date_k)
"st2" FOREIGN KEY (stock_customer_k) REFERENCES customer(customer_k)
你得到的一个很大的性能损失是大约 450MB 的中间数据存储在外部:Sort Method: external merge Disk: 460920kB
。发生这种情况是因为规划器首先需要满足 3 table 之间的连接条件,包括可能效率低下的 table customer
,然后聚合 sum()
才能发生,即使而聚合可以单独在 table stock
上完美执行。
查询
因为您的 table 相当大,您最好尽快减少符合条件的行数,最好是在任何加入之前。在这种情况下,这意味着在子查询中对 table stock
进行聚合并将该结果连接到其他两个 tables:
SELECT c.cust_id AS customer_id,
c.cust_first_name AS customer_first_name,
c.cust_last_name AS customer_last_name,
c.cust_prf AS customer_prf,
c.cust_birth_country AS customer_birth_country,
c.cust_login AS customer_login,
c.cust_email_address AS customer_email_address,
d.date_year AS ddyear,
ss.total_yr,
's' stock_type
FROM (
SELECT
stock_customer_k AS ck,
stock_soldate_k AS sdk,
sum((stock_ls_price-stock_ws_price-stock_ds_price+stock_es_price)*0.5) AS total_yr
FROM stock
GROUP BY 1, 2) ss
JOIN customer c ON c.customer_k = ss.ck
JOIN date d ON d.date_k = ss.sdk;
stock
上的子查询将产生更少的行,具体取决于每个客户每个日期的平均订单数。此外,在 sum()
函数中,乘以 0.5 比除以 2 便宜得多(尽管在宏伟的计划中它将相对微不足道)。
数据模型
你也应该认真看看你的数据模型。
在 table customer
中,您使用像 char(30)
这样的数据类型,即使您只存储 'X',它也总是会在行中占用 30 个字节。当许多字符串比声明的最大宽度短时,使用 varchar(30)
数据类型效率更高,因为它占用更少 space,因此需要更少的页面读取(和写入中间数据)。
Table stock
使用 numeric(7,2)
作为价格。使用 numeric
数据类型可能会在对数据进行多次重复操作时给出准确的结果,但它们也非常慢。 double precision
数据类型在您的场景中会更快并且同样准确。出于演示目的,您可以将值四舍五入到所需的精度。
作为建议,使用 double precision
数据类型而不是 numeric
创建 table stock_f
,将所有数据从 stock
复制到 stock_f
和 运行 关于 table.
试试这个:
with stock_grouped as
(select stock_customer_k, date_year, sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr
from stock, date
where stock_soldate_k = date_k
group by stock_customer_k, date_year)
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
total_yr,
's' stock_type
from customer, stock_grouped
where customer_k = stock_customer_k
此查询预期通过连接进行分组。