Postgresql - LEFT JOIN LATERAL 比子查询慢
Postgresql - LEFT JOIN LATERAL is too slow than subquery
我在 postgresql 9.5 中使用 'LEFT JOIN LATERAL' 函数时遇到一些困难。
在我的table中,有'ID'、'DATE'、'CODE'三列。一个人(ID)有多个行,如下所示。 ID个数362,总行数约2500000
ID / DATE / CODE
1 / 20020101 / drugA
1 / 20020102 / drugA
1 / 20020103 / drugB
1 / 20020104 / drugA
1 / 20020105 / drugA
1 / 20020106 / drugB
1 / 20020107 / drugA
2 / ... / ...
我需要汇总药物A从第一天到最后一天使用药物B的信息。
在上面的例子中,ID(1)只需要保留两行[介于20020103 ~ 20020106之间;药物B的时期。
1 / 20020104 / drugA
1 / 20020105 / drugA
为了接受这份工作,我使用 'LEFT LATERAL JOIN' 编写了 SQL 代码,如下所示。
SELECT * FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE = 'drugA' GROUP BY ID) AA
LEFT JOIN LATERAL (SELECT ID, COUNT(ID) as no_tx, min(DATE) as fday_tx, max(DATE) lday_tx from MAIN_TABLE WHERE CODE = 'drugB' AND DATE > AA.start_date AND DATE < AA.end_date GROUP BY ID) as BB USING(ID);
只有362个人ID,但是这个postgresql代码大概需要2分钟。
太慢了。因此,我尝试了另一个 SQL 使用子查询的代码。
SELECT * FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE ='drugA' GROUP BY ID) AA
LEFT JOIN (
SELECT ID, COUNT(ID) as no_tx, min(DATE) as fday_tx, max(DATE) lday_tx FROM (SELECT ID, DATE, CODE FROM MAIN_TABLE) BB
LEFT JOIN (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE ='drugA' GROUP BY ID) CC USING (ID)
WHERE CODE = 'drugB' and DATE > start_date and DATE < end_date GROUP BY ID
) DD USING (ID);
这段代码不简单但非常快(仅需 1.6 秒)。
当我比较两个代码的解释时,第二个代码使用散列连接,但第一个代码没有。
我能得到一些提示来更有效地改进第一个代码 'LEFT LATERAL JOIN' 函数吗?
为什么不直接使用 join
和 group by
?
SELECT AA.ID, COUNT(B.ID) as no_tx, min(B.DATE) as fday_tx, max(B.DATE) as lday_tx,
AA.start_date, AA.end_date
FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date
FROM MAIN_TABLE
WHERE CODE = 'drugA'
GROUP BY ID
) AA LEFT JOIN
MAIN_TABLE b
ON b.CODE = 'drugB' AND b.DATE > AA.start_date AND b.DATE < AA.end_date
GROUP BY AA.ID, AA.start_date, AA.end_date;
或者,也许更有效,window 函数:
SELECT ID, SUM(CASE WHEN code = 'drugB' THEN 1 ELSE 0 END) as no_tx,
MIN(CASE WHEN code = 'drugB' THEN DATE END) as fday_tx,
MIN(CASE WHEN code = 'drugB' THEN DATE END) as lday_tx,
start_date, end_date
FROM (SELECT t.*,
MIN(CASE WHEN code = 'drugA' THEN date END) as start_date,
MAX(CASE WHEN code = 'drugB' THEN date END) as end_date
FROM MAIN_TABLE t
) t
WHERE code in ('drugA', 'drugB') AND
date between start_date and end_date
GROUP BY t.id;
我在 postgresql 9.5 中使用 'LEFT JOIN LATERAL' 函数时遇到一些困难。
在我的table中,有'ID'、'DATE'、'CODE'三列。一个人(ID)有多个行,如下所示。 ID个数362,总行数约2500000
ID / DATE / CODE
1 / 20020101 / drugA
1 / 20020102 / drugA
1 / 20020103 / drugB
1 / 20020104 / drugA
1 / 20020105 / drugA
1 / 20020106 / drugB
1 / 20020107 / drugA
2 / ... / ...
我需要汇总药物A从第一天到最后一天使用药物B的信息。
在上面的例子中,ID(1)只需要保留两行[介于20020103 ~ 20020106之间;药物B的时期。
1 / 20020104 / drugA
1 / 20020105 / drugA
为了接受这份工作,我使用 'LEFT LATERAL JOIN' 编写了 SQL 代码,如下所示。
SELECT * FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE = 'drugA' GROUP BY ID) AA
LEFT JOIN LATERAL (SELECT ID, COUNT(ID) as no_tx, min(DATE) as fday_tx, max(DATE) lday_tx from MAIN_TABLE WHERE CODE = 'drugB' AND DATE > AA.start_date AND DATE < AA.end_date GROUP BY ID) as BB USING(ID);
只有362个人ID,但是这个postgresql代码大概需要2分钟。
太慢了。因此,我尝试了另一个 SQL 使用子查询的代码。
SELECT * FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE ='drugA' GROUP BY ID) AA
LEFT JOIN (
SELECT ID, COUNT(ID) as no_tx, min(DATE) as fday_tx, max(DATE) lday_tx FROM (SELECT ID, DATE, CODE FROM MAIN_TABLE) BB
LEFT JOIN (SELECT ID, min(DATE) as start_date, max(DATE) as end_date from MAIN_TABLE WHERE CODE ='drugA' GROUP BY ID) CC USING (ID)
WHERE CODE = 'drugB' and DATE > start_date and DATE < end_date GROUP BY ID
) DD USING (ID);
这段代码不简单但非常快(仅需 1.6 秒)。
当我比较两个代码的解释时,第二个代码使用散列连接,但第一个代码没有。
我能得到一些提示来更有效地改进第一个代码 'LEFT LATERAL JOIN' 函数吗?
为什么不直接使用 join
和 group by
?
SELECT AA.ID, COUNT(B.ID) as no_tx, min(B.DATE) as fday_tx, max(B.DATE) as lday_tx,
AA.start_date, AA.end_date
FROM (SELECT ID, min(DATE) as start_date, max(DATE) as end_date
FROM MAIN_TABLE
WHERE CODE = 'drugA'
GROUP BY ID
) AA LEFT JOIN
MAIN_TABLE b
ON b.CODE = 'drugB' AND b.DATE > AA.start_date AND b.DATE < AA.end_date
GROUP BY AA.ID, AA.start_date, AA.end_date;
或者,也许更有效,window 函数:
SELECT ID, SUM(CASE WHEN code = 'drugB' THEN 1 ELSE 0 END) as no_tx,
MIN(CASE WHEN code = 'drugB' THEN DATE END) as fday_tx,
MIN(CASE WHEN code = 'drugB' THEN DATE END) as lday_tx,
start_date, end_date
FROM (SELECT t.*,
MIN(CASE WHEN code = 'drugA' THEN date END) as start_date,
MAX(CASE WHEN code = 'drugB' THEN date END) as end_date
FROM MAIN_TABLE t
) t
WHERE code in ('drugA', 'drugB') AND
date between start_date and end_date
GROUP BY t.id;