pg_wal 备用节点上的文件夹未删除文件 (postgresql-11)
pg_wal folder on standby node not removing files (postgresql-11)
我在2个物理节点上设置了主从(主备)流复制。虽然复制工作正常并且 walsender 和 walreceiver 都工作正常,但从属节点上的 pg_wal
文件夹中的文件没有被删除。这是我每次尝试在崩溃后恢复从节点时一直面临的问题。以下是问题的详细信息:
postgresql.conf 在主节点和 slave/standby 节点上
# Connection settings
# -------------------
listen_addresses = '*'
port = 5432
max_connections = 400
tcp_keepalives_idle = 0
tcp_keepalives_interval = 0
tcp_keepalives_count = 0
# Memory-related settings
# -----------------------
shared_buffers = 32GB # Physical memory 1/4
##DEBUG: mmap(1652555776) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
#huge_pages = try # on, off, or try
#temp_buffers = 16MB # depends on DB checklist
work_mem = 8MB # Need tuning
effective_cache_size = 64GB # Physical memory 1/2
maintenance_work_mem = 512MB
wal_buffers = 64MB
# WAL/Replication/HA settings
# --------------------
wal_level = logical
synchronous_commit = remote_write
archive_mode = on
archive_command = 'rsync -a %p /TPINFO01/wal_archive/%f'
#archive_command = ':'
max_wal_senders=5
hot_standby = on
restart_after_crash = off
wal_sender_timeout = 5000
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
hot_standby_feedback = on
random_page_cost = 1.5
max_wal_size = 5GB
min_wal_size = 200MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
# Logging settings
# ----------------
log_destination = 'csvlog,syslog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql_%Y%m%d.log'
log_truncate_on_rotation = off
log_rotation_age = 1h
log_rotation_size = 0
log_timezone = 'Japan'
log_line_prefix = '%t [%p]: [%l-1] %h:%u@%d:[PG]:CODE:%e '
log_statement = all
log_min_messages = info # DEBUG5
log_min_error_statement = info # DEBUG5
log_error_verbosity = default
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_connections = on
log_disconnections = on
log_duration = off
log_min_duration_statement = 1000
log_autovacuum_min_duration = 3000ms
track_functions = pl
track_activity_query_size = 8192
# Locale/display settings
# -----------------------
lc_messages = 'C'
lc_monetary = 'en_US.UTF-8' # ja_JP.eucJP
lc_numeric = 'en_US.UTF-8' # ja_JP.eucJP
lc_time = 'en_US.UTF-8' # ja_JP.eucJP
timezone = 'Asia/Tokyo'
bytea_output = 'escape'
# Auto vacuum settings
# -----------------------
autovacuum = on
autovacuum_max_workers = 3
autovacuum_vacuum_cost_limit = 200
auto_explain.log_min_duration = 10000
auto_explain.log_analyze = on
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
recovery.conf
primary_conninfo = 'host=xxx.xx.xx.xx port=5432 user=replica application_name=xxxxx keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'rsync -a /TPINFO01/wal_archive/%f %p'
recovery_target_timeline = 'latest'
standby_mode = 'on'
pg_stat_replication
在 master/primary
上的结果
select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 8868
usesysid | 16420
usename | xxxxxxx
application_name | sub_xxxxxxx
client_addr | xx.xx.xxx.xxx
client_hostname |
client_port | 21110
backend_start | 2021-06-10 10:55:37.61795+09
backend_xmin |
state | streaming
sent_lsn | 97AC/589D93B8
write_lsn | 97AC/589D93B8
flush_lsn | 97AC/589D93B8
replay_lsn | 97AC/589D93B8
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
-[ RECORD 2 ]----+------------------------------
pid | 221533
usesysid | 3541624258
usename | replica
application_name | xxxxx
client_addr | xxx.xx.xx.xx
client_hostname |
client_port | 55338
backend_start | 2021-06-12 21:26:40.192443+09
backend_xmin | 72866358
state | streaming
sent_lsn | 97AC/589D93B8
write_lsn | 97AC/589D93B8
flush_lsn | 97AC/589D93B8
replay_lsn | 97AC/589D93B8
write_lag |
flush_lag |
replay_lag |
sync_priority | 1
sync_state | sync
我遵循的使备用节点从崩溃中恢复的步骤
- 在 master 上启动
select pg_start_backup('backup');
- rsync 数据文件夹和 wal_archive 文件夹从 master/primary 到 slave/standby
- 关于大师`select pg_stop_backup();
- 在 slave/standby 节点上重新启动 postgres。
这导致 slave/standby 节点与主节点同步,此后一直运行良好。
在 primary/master 节点上,pg_wal 文件夹的文件在将近 2 小时后被删除。但是 slave/standby 节点上的文件没有被删除。几乎所有的文件都在archive_status
文件夹里,pg_wal
文件夹里还有<filename>.done
文件夹,备节点上也是。
我想如果我执行切换,问题就会消失,但我仍然想了解它发生的原因。
请看,我也在尝试寻找以下一些问题的答案:
- 哪个进程将文件写入 slave/standby 节点上的 pg_wal?我正在关注这个 link
https://severalnines.com/database-blog/postgresql-streaming-replication-deep-dive
- 哪个参数从备用节点上的 pg_wal 文件夹中删除文件?
- 他们是否需要像转到主节点上的 wal_archive 文件夹一样转到磁盘上的 wal_archive 文件夹?
您没有描述在 rsync 期间省略 pg_replslot,如 the docs recommend。如果你没有忽略它,那么现在你的副本有一个复制槽,它是主副本的克隆。但是,如果没有任何东西连接到副本 上的那个插槽 并提前截止,那么 WAL 永远不会被释放以进行回收。要修复您只需要关闭副本,删除该目录,重新启动它(并等待下一个重新启动点完成)。
Do they need to go to wal_archive folder on the disk just like they go to wal_archive folder on the master node?
不,那是可选的,不是必须的。如果您希望它发生,它由 archive_mode = always
设置。
我在2个物理节点上设置了主从(主备)流复制。虽然复制工作正常并且 walsender 和 walreceiver 都工作正常,但从属节点上的 pg_wal
文件夹中的文件没有被删除。这是我每次尝试在崩溃后恢复从节点时一直面临的问题。以下是问题的详细信息:
postgresql.conf 在主节点和 slave/standby 节点上
# Connection settings
# -------------------
listen_addresses = '*'
port = 5432
max_connections = 400
tcp_keepalives_idle = 0
tcp_keepalives_interval = 0
tcp_keepalives_count = 0
# Memory-related settings
# -----------------------
shared_buffers = 32GB # Physical memory 1/4
##DEBUG: mmap(1652555776) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
#huge_pages = try # on, off, or try
#temp_buffers = 16MB # depends on DB checklist
work_mem = 8MB # Need tuning
effective_cache_size = 64GB # Physical memory 1/2
maintenance_work_mem = 512MB
wal_buffers = 64MB
# WAL/Replication/HA settings
# --------------------
wal_level = logical
synchronous_commit = remote_write
archive_mode = on
archive_command = 'rsync -a %p /TPINFO01/wal_archive/%f'
#archive_command = ':'
max_wal_senders=5
hot_standby = on
restart_after_crash = off
wal_sender_timeout = 5000
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
hot_standby_feedback = on
random_page_cost = 1.5
max_wal_size = 5GB
min_wal_size = 200MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
# Logging settings
# ----------------
log_destination = 'csvlog,syslog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql_%Y%m%d.log'
log_truncate_on_rotation = off
log_rotation_age = 1h
log_rotation_size = 0
log_timezone = 'Japan'
log_line_prefix = '%t [%p]: [%l-1] %h:%u@%d:[PG]:CODE:%e '
log_statement = all
log_min_messages = info # DEBUG5
log_min_error_statement = info # DEBUG5
log_error_verbosity = default
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_connections = on
log_disconnections = on
log_duration = off
log_min_duration_statement = 1000
log_autovacuum_min_duration = 3000ms
track_functions = pl
track_activity_query_size = 8192
# Locale/display settings
# -----------------------
lc_messages = 'C'
lc_monetary = 'en_US.UTF-8' # ja_JP.eucJP
lc_numeric = 'en_US.UTF-8' # ja_JP.eucJP
lc_time = 'en_US.UTF-8' # ja_JP.eucJP
timezone = 'Asia/Tokyo'
bytea_output = 'escape'
# Auto vacuum settings
# -----------------------
autovacuum = on
autovacuum_max_workers = 3
autovacuum_vacuum_cost_limit = 200
auto_explain.log_min_duration = 10000
auto_explain.log_analyze = on
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
recovery.conf
primary_conninfo = 'host=xxx.xx.xx.xx port=5432 user=replica application_name=xxxxx keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'rsync -a /TPINFO01/wal_archive/%f %p'
recovery_target_timeline = 'latest'
standby_mode = 'on'
pg_stat_replication
在 master/primary
select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 8868
usesysid | 16420
usename | xxxxxxx
application_name | sub_xxxxxxx
client_addr | xx.xx.xxx.xxx
client_hostname |
client_port | 21110
backend_start | 2021-06-10 10:55:37.61795+09
backend_xmin |
state | streaming
sent_lsn | 97AC/589D93B8
write_lsn | 97AC/589D93B8
flush_lsn | 97AC/589D93B8
replay_lsn | 97AC/589D93B8
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
-[ RECORD 2 ]----+------------------------------
pid | 221533
usesysid | 3541624258
usename | replica
application_name | xxxxx
client_addr | xxx.xx.xx.xx
client_hostname |
client_port | 55338
backend_start | 2021-06-12 21:26:40.192443+09
backend_xmin | 72866358
state | streaming
sent_lsn | 97AC/589D93B8
write_lsn | 97AC/589D93B8
flush_lsn | 97AC/589D93B8
replay_lsn | 97AC/589D93B8
write_lag |
flush_lag |
replay_lag |
sync_priority | 1
sync_state | sync
我遵循的使备用节点从崩溃中恢复的步骤
- 在 master 上启动
select pg_start_backup('backup');
- rsync 数据文件夹和 wal_archive 文件夹从 master/primary 到 slave/standby
- 关于大师`select pg_stop_backup();
- 在 slave/standby 节点上重新启动 postgres。
这导致 slave/standby 节点与主节点同步,此后一直运行良好。
在 primary/master 节点上,pg_wal 文件夹的文件在将近 2 小时后被删除。但是 slave/standby 节点上的文件没有被删除。几乎所有的文件都在archive_status
文件夹里,pg_wal
文件夹里还有<filename>.done
文件夹,备节点上也是。
我想如果我执行切换,问题就会消失,但我仍然想了解它发生的原因。
请看,我也在尝试寻找以下一些问题的答案:
- 哪个进程将文件写入 slave/standby 节点上的 pg_wal?我正在关注这个 link https://severalnines.com/database-blog/postgresql-streaming-replication-deep-dive
- 哪个参数从备用节点上的 pg_wal 文件夹中删除文件?
- 他们是否需要像转到主节点上的 wal_archive 文件夹一样转到磁盘上的 wal_archive 文件夹?
您没有描述在 rsync 期间省略 pg_replslot,如 the docs recommend。如果你没有忽略它,那么现在你的副本有一个复制槽,它是主副本的克隆。但是,如果没有任何东西连接到副本 上的那个插槽 并提前截止,那么 WAL 永远不会被释放以进行回收。要修复您只需要关闭副本,删除该目录,重新启动它(并等待下一个重新启动点完成)。
Do they need to go to wal_archive folder on the disk just like they go to wal_archive folder on the master node?
不,那是可选的,不是必须的。如果您希望它发生,它由 archive_mode = always
设置。