分布式 postgres 部署不同步

distributed postgres deployment out of sync

我设置了分布式 postgres 数据库系统并使用 hot_standby wal_level 配置了复制。

有一个中央主数据库,有多个副本(目前全世界有 15 个)用作只读实例 - 因此不需要故障转移 - 我们只想将数据同步到我们所在的远程位置可以阅读它们。

师傅:

wal_level = hot_standby
max_wal_senders = 20
checkpoint_segments = 8    
wal_keep_segments = 8

客户:

wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 8    
wal_keep_segments = 8 
hot_standby = on

/var/lib/postgresql/9.4/recovery.conf 在客户端:

standby_mode = 'on'
primary_conninfo = 'host=<IP of master> port=5432 user=replicator password=xxxx sslmode=require'
trigger_file = '/tmp/postgresql.trigger'

复制开始 - 几天来一切似乎都很好。 几天后,似乎没有更多的连接在主服务器上接受用于复制...

客户:

2017-05-04 01:16:51 UTC [9608-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:16:57 UTC [10807-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:17:02 UTC [12022-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:17:06 UTC [13217-1] FATAL:  could not connect to the primary server: FATAL:  remaining connection slots are reserved for non-replication superuser connections
...

师傅:

然后日志中充满了如下所示的消息 - 它永远不会恢复...

2017-05-04 08:44:14 UTC [24850-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:19 UTC [25958-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:24 UTC [27063-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:29 UTC [28144-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:34 UTC [29227-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:39 UTC [30316-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
...

客户:

2017-04-30 11:26:22 UTC [28474-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:22 UTC [28474-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:26 UTC [29328-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:26 UTC [29328-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:31 UTC [30394-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:31 UTC [30394-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
...

所以我的问题是:

https://www.postgresql.org/docs/current/static/runtime-config-replication.html:

max_wal_senders (integer)

Specifies the maximum number of concurrent connections from standby servers or streaming base backup clients (i.e., the maximum number of simultaneously running WAL sender processes). The default is zero, meaning replication is disabled. WAL sender processes count towards the total number of connections, so the parameter cannot be set higher than max_connections. Abrupt streaming client disconnection might cause an orphaned connection slot until a timeout is reached, so this parameter should be set slightly higher than the maximum number of expected clients so disconnected clients can immediately reconnect.

(强调我的)。 应用程序连接或孤立连接导致您

FATAL: sorry, too many clients already

您可能想为应用程序使用一些连接池,例如 pgbouncer,在它们实际发生之前限制太多连接。

回答你的问题,如果你 have archive_command 设置为在某处实际复制 WAL,请修改 recovery.conf 中的 restore_command 在奴隶上捡起它们。它将允许从丢失流的那一刻起赶上。否则你必须重建它。