SQL 加快我的更新速度 - 左加入 PostgresQL
SQL Speed Up my Update - Left Join in PostgresQL
我正在尝试使用两个共享的 ID(patient_id、encounter_id)将一个数据框连接到另一个数据框。两个数据帧都在这些 id 上建立了索引。
这是左轴:
tnx_prophy=# \d diagnosis
Table "public.diagnosis"
Column | Type | Collation | Nullable | Default
-------------------------------+------+-----------+----------+---------
patient_id | text | | |
encounter_id | text | | |
code_system | text | | |
code | text | | |
principal_diagnosis_indicator | text | | |
date | text | | |
Indexes:
"idx_pt_enc_dx" btree (patient_id, encounter_id)
这是 RHS:
tnx_prophy=# \d encounter
Table "public.encounter"
Column | Type | Collation | Nullable | Default
--------------+------+-----------+----------+---------
encounter_id | text | | |
patient_id | text | | |
type | text | | |
enc_type | text | | |
Indexes:
"idx_pt_enc_enc" btree (patient_id, encounter_id)
数据集很大(约 5 亿行?),但我的 UPDATE 和 JOIN 函数似乎花费的时间比我希望的要长得多。是的,我想更新(不只是生成一个临时 table)
tnx_prophy=# ALTER TABLE diagnosis ADD COLUMN enc_type text;
ALTER TABLE
tnx_prophy=# UPDATE diagnosis
tnx_prophy-# SET enc_type = encounter.enc_type
tnx_prophy-# FROM encounter
tnx_prophy-# WHERE (diagnosis.patient_id, diagnosis.encounter_id) = (encounter.patient_id, encounter.encounter_id);
关于如何更快地执行此操作的任何建议?还是我弄乱了这里的语法?如果有人可以提供帮助,非常感谢!
\i tmp.sql
CREATE TABLE diagnosis
( patient_id text
, encounter_id text
-- , code_system text
-- , code text
, principal_diagnosis_indicator text
-- , date text
);
CREATE INDEX idx_pt_enc_dx ON diagnosis (patient_id, encounter_id);
CREATE TABLE encounter
( encounter_id text
, patient_id text
, type text
, enc_type text
);
CREATE INDEX idx_pt_enc_enc ON encounter (patient_id, encounter_id);
INSERT INTO diagnosis(patient_id, encounter_id, principal_diagnosis_indicator) VALUES
(1,1, 'influenza')
,(1,1, 'cancer')
,(2,1, 'influenza')
,(2,1, 'cancer')
;
INSERT INTO encounter(patient_id, encounter_id, enc_type) VALUES
( 1,1, 'OMG')
,( 1,1, 'WTF')
,( 2,1, 'WTF')
,( 2,1, 'OMG')
;
ALTER TABLE diagnosis ADD COLUMN enc_type text;
EXPLAIN ANALYZE
UPDATE diagnosis dst
SET enc_type = src.enc_type
FROM encounter src
WHERE (dst.patient_id, dst.encounter_id) = (src.patient_id, src.encounter_id)
AND dst.enc_type IS DISTINCT FROM src.enc_type -- both columns are NULLABLE
;
SELECT * FROM diagnosis;
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
CREATE INDEX
CREATE TABLE
CREATE INDEX
INSERT 0 4
INSERT 0 4
ALTER TABLE
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Update on diagnosis dst (cost=0.30..47.59 rows=7 width=140) (actual time=0.383..0.385 rows=0 loops=1)
-> Merge Join (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8 loops=1)
Merge Cond: ((dst.patient_id = src.patient_id) AND (dst.encounter_id = src.encounter_id))
Join Filter: (dst.enc_type IS DISTINCT FROM src.enc_type)
-> Index Scan using idx_pt_enc_dx on diagnosis dst (cost=0.15..21.15 rows=520 width=134) (actual time=0.066..0.082 rows=4 loops=1)
-> Index Scan using idx_pt_enc_enc on encounter src (cost=0.15..21.15 rows=520 width=102) (actual time=0.051..0.086 rows=7 loops=1)
Planning Time: 1.278 ms
Execution Time: 0.858 ms
(8 rows)
patient_id | encounter_id | principal_diagnosis_indicator | enc_type
------------+--------------+-------------------------------+----------
1 | 1 | cancer | WTF
1 | 1 | influenza | WTF
2 | 1 | cancer | OMG
2 | 1 | influenza | OMG
(4 rows)
好好看看
Merge Join (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8 loops=1)
行:更新了8行,但是只有4条记录!!发生这种情况是因为 search-key 进入您的 查找 table 不是 unique
。每条记录更新两次,(顺序未定义...)!
我正在尝试使用两个共享的 ID(patient_id、encounter_id)将一个数据框连接到另一个数据框。两个数据帧都在这些 id 上建立了索引。
这是左轴:
tnx_prophy=# \d diagnosis
Table "public.diagnosis"
Column | Type | Collation | Nullable | Default
-------------------------------+------+-----------+----------+---------
patient_id | text | | |
encounter_id | text | | |
code_system | text | | |
code | text | | |
principal_diagnosis_indicator | text | | |
date | text | | |
Indexes:
"idx_pt_enc_dx" btree (patient_id, encounter_id)
这是 RHS:
tnx_prophy=# \d encounter
Table "public.encounter"
Column | Type | Collation | Nullable | Default
--------------+------+-----------+----------+---------
encounter_id | text | | |
patient_id | text | | |
type | text | | |
enc_type | text | | |
Indexes:
"idx_pt_enc_enc" btree (patient_id, encounter_id)
数据集很大(约 5 亿行?),但我的 UPDATE 和 JOIN 函数似乎花费的时间比我希望的要长得多。是的,我想更新(不只是生成一个临时 table)
tnx_prophy=# ALTER TABLE diagnosis ADD COLUMN enc_type text;
ALTER TABLE
tnx_prophy=# UPDATE diagnosis
tnx_prophy-# SET enc_type = encounter.enc_type
tnx_prophy-# FROM encounter
tnx_prophy-# WHERE (diagnosis.patient_id, diagnosis.encounter_id) = (encounter.patient_id, encounter.encounter_id);
关于如何更快地执行此操作的任何建议?还是我弄乱了这里的语法?如果有人可以提供帮助,非常感谢!
\i tmp.sql
CREATE TABLE diagnosis
( patient_id text
, encounter_id text
-- , code_system text
-- , code text
, principal_diagnosis_indicator text
-- , date text
);
CREATE INDEX idx_pt_enc_dx ON diagnosis (patient_id, encounter_id);
CREATE TABLE encounter
( encounter_id text
, patient_id text
, type text
, enc_type text
);
CREATE INDEX idx_pt_enc_enc ON encounter (patient_id, encounter_id);
INSERT INTO diagnosis(patient_id, encounter_id, principal_diagnosis_indicator) VALUES
(1,1, 'influenza')
,(1,1, 'cancer')
,(2,1, 'influenza')
,(2,1, 'cancer')
;
INSERT INTO encounter(patient_id, encounter_id, enc_type) VALUES
( 1,1, 'OMG')
,( 1,1, 'WTF')
,( 2,1, 'WTF')
,( 2,1, 'OMG')
;
ALTER TABLE diagnosis ADD COLUMN enc_type text;
EXPLAIN ANALYZE
UPDATE diagnosis dst
SET enc_type = src.enc_type
FROM encounter src
WHERE (dst.patient_id, dst.encounter_id) = (src.patient_id, src.encounter_id)
AND dst.enc_type IS DISTINCT FROM src.enc_type -- both columns are NULLABLE
;
SELECT * FROM diagnosis;
结果:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
CREATE INDEX
CREATE TABLE
CREATE INDEX
INSERT 0 4
INSERT 0 4
ALTER TABLE
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Update on diagnosis dst (cost=0.30..47.59 rows=7 width=140) (actual time=0.383..0.385 rows=0 loops=1)
-> Merge Join (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8 loops=1)
Merge Cond: ((dst.patient_id = src.patient_id) AND (dst.encounter_id = src.encounter_id))
Join Filter: (dst.enc_type IS DISTINCT FROM src.enc_type)
-> Index Scan using idx_pt_enc_dx on diagnosis dst (cost=0.15..21.15 rows=520 width=134) (actual time=0.066..0.082 rows=4 loops=1)
-> Index Scan using idx_pt_enc_enc on encounter src (cost=0.15..21.15 rows=520 width=102) (actual time=0.051..0.086 rows=7 loops=1)
Planning Time: 1.278 ms
Execution Time: 0.858 ms
(8 rows)
patient_id | encounter_id | principal_diagnosis_indicator | enc_type
------------+--------------+-------------------------------+----------
1 | 1 | cancer | WTF
1 | 1 | influenza | WTF
2 | 1 | cancer | OMG
2 | 1 | influenza | OMG
(4 rows)
好好看看
Merge Join (cost=0.30..47.59 rows=7 width=140) (actual time=0.139..0.232 rows=8 loops=1)
行:更新了8行,但是只有4条记录!!发生这种情况是因为 search-key 进入您的 查找 table 不是 unique
。每条记录更新两次,(顺序未定义...)!