pg_xlog 文件未在从站上回收

pg_xlog files not recycling on slave

我已经使用 postgres 9.3 设置了流式复制
我的问题是,在从属服务器上,pg_xlog 文件夹变得越来越满,WAL 文件没有得到回收。

从属服务器在postgresql.conf中从属服务器有以下(相关)值:

wal_keep_segments = 150
hot_standby = on
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = off
#archive_command = ''

我的初始复制命令是:

pg_basebackup  --xlog-method=stream -h <master-ip> -D .  --username=replication --password

所以我想我的 WAL 文件没问题。 这是我的从服务器启动日志:

2017-05-08 09:55:31 IDT LOG:  database system was shut down in recovery at 2017-05-08 09:55:19 IDT
2017-05-08 09:55:31 IDT LOG:  entering standby mode
2017-05-08 09:55:31 IDT LOG:  redo starts at 361/C76DD3E8
2017-05-08 09:55:31 IDT LOG:  consistent recovery state reached at 361/C89A8278
2017-05-08 09:55:31 IDT LOG:  database system is ready to accept read only connections
2017-05-08 09:55:31 IDT LOG:  record with zero length at 361/C89A8278
2017-05-08 09:55:31 IDT LOG:  started streaming WAL from primary at 361/C8000000 on timeline 1
2017-05-08 09:55:32 IDT LOG:  incomplete startup packet
2017-05-08 09:58:34 IDT LOG:  received SIGHUP, reloading configuration files
2017-05-08 09:58:34 IDT LOG:  parameter "checkpoint_completion_target" changed to "0.9"

我什至尝试手动将旧的 WAL 文件从主服务器复制到从服务器,但这也没有帮助。
我究竟做错了什么?如何阻止 pg_xlog 文件夹无限增长?
它与 "incomplete startup packet" 日志消息有关吗?

最后一件事:在 pg_xlog\archive_status 文件夹下,所有 WAL 文件都带有 .done 后缀。

感谢我能得到的任何帮助。

编辑:

我在 postgresql.conf 中启用了 log_checkpoints。
这是我启用它后的相关日志条目:

2017-05-12 08:43:11 IDT LOG:  parameter "log_checkpoints" changed to "on"
2017-05-12 08:43:24 IDT LOG:  checkpoint complete: wrote 2128 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=189.240 s, sync=0.167 s, total=189.549 s; sync files=745, longest=0.010 s, average=0.000 s
2017-05-12 08:45:15 IDT LOG:  checkpoint starting: time
2017-05-12 08:48:46 IDT LOG:  checkpoint complete: wrote 15175 buffers (6.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.078 s, sync=1.454 s, total=210.617 s; sync files=769, longest=0.032 s, average=0.001 s
2017-05-12 08:50:15 IDT LOG:  checkpoint starting: time
2017-05-12 08:53:45 IDT LOG:  checkpoint complete: wrote 2480 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.162 s, sync=0.991 s, total=210.253 s; sync files=663, longest=0.076 s, average=0.001 s

编辑2:

由于我的从服务器在日志中没有重启点,这里是在达到一致恢复状态之前在从服务器中启动和恢复WALS的相关日志:

2017-05-12 09:35:42 IDT LOG:  database system was shut down in recovery at 2017-05-12 09:35:41 IDT
2017-05-12 09:35:42 IDT LOG:  entering standby mode
2017-05-12 09:35:42 IDT LOG:  incomplete startup packet
2017-05-12 09:35:43 IDT FATAL:  the database system is starting up
2017-05-12 09:35:43 IDT LOG:  restored log file "0000000100000369000000B1" from archive
2017-05-12 09:35:43 IDT FATAL:  the database system is starting up
2017-05-12 09:35:44 IDT FATAL:  the database system is starting up
2017-05-12 09:35:44 IDT LOG:  restored log file "0000000100000369000000AF" from archive
2017-05-12 09:35:44 IDT LOG:  redo starts at 369/AFD28900
2017-05-12 09:35:44 IDT FATAL:  the database system is starting up
2017-05-12 09:35:45 IDT FATAL:  the database system is starting up
2017-05-12 09:35:45 IDT FATAL:  the database system is starting up
2017-05-12 09:35:46 IDT LOG:  restored log file "0000000100000369000000B0" from archive
2017-05-12 09:35:46 IDT FATAL:  the database system is starting up
2017-05-12 09:35:46 IDT FATAL:  the database system is starting up
2017-05-12 09:35:47 IDT FATAL:  the database system is starting up
2017-05-12 09:35:47 IDT LOG:  restored log file "0000000100000369000000B1" from archive
2017-05-12 09:35:47 IDT FATAL:  the database system is starting up
2017-05-12 09:35:48 IDT FATAL:  the database system is starting up
2017-05-12 09:35:48 IDT LOG:  incomplete startup packet
2017-05-12 09:35:49 IDT LOG:  restored log file "0000000100000369000000B2" from archive
2017-05-12 09:35:50 IDT LOG:  restored log file "0000000100000369000000B3" from archive
2017-05-12 09:35:52 IDT LOG:  restored log file "0000000100000369000000B4" from archive   
.
.
.
2017-05-12 09:42:33 IDT LOG:  restored log file "000000010000036A000000C0" from archive
2017-05-12 09:42:35 IDT LOG:  restored log file "000000010000036A000000C1" from archive
2017-05-12 09:42:36 IDT LOG:  restored log file "000000010000036A000000C2" from archive
2017-05-12 09:42:37 IDT LOG:  restored log file "000000010000036A000000C3" from archive
2017-05-12 09:42:37 IDT LOG:  consistent recovery state reached at 36A/C3ACEB28
2017-05-12 09:42:37 IDT LOG:  database system is ready to accept read only connections
2017-05-12 09:42:39 IDT LOG:  restored log file "000000010000036A000000C4" from archive
2017-05-12 09:42:40 IDT LOG:  restored log file "000000010000036A000000C5" from archive
2017-05-12 09:42:42 IDT LOG:  restored log file "000000010000036A000000C6" from archive
ERROR: WAL file '000000010000036A000000C7' not found in server 'main-db-server'
2017-05-12 09:42:42 IDT LOG:  started streaming WAL from primary at 36A/C6000000 on timeline 1

谢谢!

问题好像已经解决了。

显然我在主服务器上遇到了硬件问题。
我能够执行完整 pg_dump 并重新索引我的数据库,所以我 非常确定 我没有任何数据完整性问题。

但是当我在配置中启用 log_checkpoints 后查看主服务器日志时 - 在从服务器停止执行检查点前几分钟,我看到了以下消息:

IDT ERROR:  failed to re-find parent key in index "<table_name>_id_udx" for split pages 17/18

看到之后 - 我决定更换托管服务提供商并将我的数据库移至新服务器。 从那时起(现在差不多一周了)- 一切都 运行 顺利复制,检查点 运行 符合预期。

我真的希望这会帮助其他人 - 但是当发生这样的事情时 - 总是被告知这个问题可能是由数据 integrity/hardware 问题引起的。