使用 join on big table 进行更新 - 性能提示?
Update using join on big table - performance tips?
一直在为这个更新而苦苦挣扎,永远不会完成
update votings v
set voter_id = (select pv.number from voters pv WHERE pv.person_id = v.person_id);
Table目前有96M条记录
select count(0) from votings;
count
----------
96575239
(1 registro)
更新显然正在使用索引
explain update votings v
set voter_id = (select pv.number from voters pv WHERE pv.rl_person_id = v.person_id);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Update on votings v (cost=0.00..788637465.40 rows=91339856 width=1671)
-> Seq Scan on votings v (cost=0.00..788637465.40 rows=91339856 width=1671)
SubPlan 1
-> Index Scan using idx_voter_rl_person_id on voters pv (cost=0.56..8.58 rows=1 width=9)
Index Cond: (rl_person_id = v.person_id)
(5 registros)
这是我的投票索引
Índices:
"votings_pkey" PRIMARY KEY, btree (id)
"votings_election_id_voter_id_key" UNIQUE CONSTRAINT, btree (election_id, person_id)
"votings_external_id_external_source_key" UNIQUE CONSTRAINT, btree (external_id, external_source)
"idx_votings_updated_at" btree (updated_at DESC)
"idx_votings_vote_party" btree (vote_party)
"idx_votings_vote_state_vote_party" btree (vote_state, vote_party)
"idx_votings_voter_id" btree (person_id)
Restrições de chave estrangeira:
"votings_election_id_fkey" FOREIGN KEY (election_id) REFERENCES elections(id)
"votings_voter_id_fkey" FOREIGN KEY (person_id) REFERENCES people_all(id)
伙计们,有谁在更新运行中发挥最大作用吗?正在使用的行数或连接?
我在这里可以提出的一个建议是对子查询查找使用覆盖索引:
CREATE INDEX idx_cover ON voters (person_id, number);
虽然在 select 的上下文中,这可能不会比 person_id
上的当前索引有太大优势,但在更新的上下文中,它可能更重要。原因是对于更新,此索引可能会减轻 Postgres 在更新前创建和维护原始 table 状态的副本。
如果 voting
中实际有 91339856 行,voters
上的 91339856 次索引扫描肯定是主要成本因素。顺序扫描会更快。
如果您不强制 PostgreSQL 执行嵌套循环连接,您可能会提高性能:
UPDATE votings
SET voter_id = voters.number
FROM voters
WHERE votings.person_id = voters.person_id;
更新 table 中的所有行将非常昂贵。我建议 re-creating table:
create temp_votings as
select v.*, vv.vote_id
from votings v join
voters vv
on vv.person_id = v.person_id;
对于此查询,您需要 votes(person_id, vote_id)
上的索引。我猜 person_id
可能已经是主键;如果是这样,则不需要额外的索引。
然后,您可以替换现有的 table -- 但请先备份:
truncate table votings;
insert into votings ( . . . ) -- list columns here
select . . . -- and the same columns here
from temp_votings;
一直在为这个更新而苦苦挣扎,永远不会完成
update votings v
set voter_id = (select pv.number from voters pv WHERE pv.person_id = v.person_id);
Table目前有96M条记录
select count(0) from votings;
count
----------
96575239
(1 registro)
更新显然正在使用索引
explain update votings v
set voter_id = (select pv.number from voters pv WHERE pv.rl_person_id = v.person_id);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Update on votings v (cost=0.00..788637465.40 rows=91339856 width=1671)
-> Seq Scan on votings v (cost=0.00..788637465.40 rows=91339856 width=1671)
SubPlan 1
-> Index Scan using idx_voter_rl_person_id on voters pv (cost=0.56..8.58 rows=1 width=9)
Index Cond: (rl_person_id = v.person_id)
(5 registros)
这是我的投票索引
Índices:
"votings_pkey" PRIMARY KEY, btree (id)
"votings_election_id_voter_id_key" UNIQUE CONSTRAINT, btree (election_id, person_id)
"votings_external_id_external_source_key" UNIQUE CONSTRAINT, btree (external_id, external_source)
"idx_votings_updated_at" btree (updated_at DESC)
"idx_votings_vote_party" btree (vote_party)
"idx_votings_vote_state_vote_party" btree (vote_state, vote_party)
"idx_votings_voter_id" btree (person_id)
Restrições de chave estrangeira:
"votings_election_id_fkey" FOREIGN KEY (election_id) REFERENCES elections(id)
"votings_voter_id_fkey" FOREIGN KEY (person_id) REFERENCES people_all(id)
伙计们,有谁在更新运行中发挥最大作用吗?正在使用的行数或连接?
我在这里可以提出的一个建议是对子查询查找使用覆盖索引:
CREATE INDEX idx_cover ON voters (person_id, number);
虽然在 select 的上下文中,这可能不会比 person_id
上的当前索引有太大优势,但在更新的上下文中,它可能更重要。原因是对于更新,此索引可能会减轻 Postgres 在更新前创建和维护原始 table 状态的副本。
如果 voting
中实际有 91339856 行,voters
上的 91339856 次索引扫描肯定是主要成本因素。顺序扫描会更快。
如果您不强制 PostgreSQL 执行嵌套循环连接,您可能会提高性能:
UPDATE votings
SET voter_id = voters.number
FROM voters
WHERE votings.person_id = voters.person_id;
更新 table 中的所有行将非常昂贵。我建议 re-creating table:
create temp_votings as
select v.*, vv.vote_id
from votings v join
voters vv
on vv.person_id = v.person_id;
对于此查询,您需要 votes(person_id, vote_id)
上的索引。我猜 person_id
可能已经是主键;如果是这样,则不需要额外的索引。
然后,您可以替换现有的 table -- 但请先备份:
truncate table votings;
insert into votings ( . . . ) -- list columns here
select . . . -- and the same columns here
from temp_votings;