PostgreSQL 主服务器挂起复制流程

PostgreSQL master server hangs on replication flow

首先,我不是数据工程师,所以我会尽力为您提供解决我的问题所需的一切:/

上下文: 我正在尝试创建 2 个 PostgreSQL 服务器,1 个主服务器和 1 个从服务器。

psql (PostgreSQL) 10.9 (Ubuntu 10.9-0ubuntu0.18.04.1)

据我了解,当我们只有 2 个服务器时进行同步复制不是一个好主意。但我必须明白这里发生了什么......

问题: 当我尝试执行 CREATE SCHEMA test; 时,主服务器挂起。 但是,Schema 存在于 Master 上,也存在于 Slave 上。 Master挂起是因为它在等待slave提交状态...

Master配置: /etc/postgresql/10/main/conf.d/master.conf

# Connection
listen_addresses = '127.0.0.1,slave-ip'
ssl = on
ssl_cert_file = '/etc/ssl/postgresql/certs/server.pem'
ssl_key_file = '/etc/ssl/postgresql/private/server.key'
ssl_ca_file = '/etc/ssl/postgresql/certs/server.pem'
password_encryption = scram-sha-256
# WAL
wal_level = replica
synchronous_commit = remote_apply #local works, remote_apply hangs
# Archive
archive_mode = on
archive_command = 'rsync -av %p postgres@lab-3:/var/lib/postgresql/wal_archive_lab_2/%f'
# Replication master
max_wal_senders = 2
wal_keep_segments = 100
synchronous_standby_names = 'ANY 1 ("lab-3")'

/etc/postgresql/10/main/pg_hba.conf

hostssl replication     replicate       slave-ip/32         scram-sha-256

从机配置: /etc/postgresql/10/main/conf.d/standby.conf

# Connection
listen_addresses = '127.0.0.1,master-ip'
ssl = on
ssl_cert_file = '/etc/ssl/postgresql/certs/server.pem'
ssl_key_file = '/etc/ssl/postgresql/private/server.key'
ssl_ca_file = '/etc/ssl/postgresql/certs/server.pem'
password_encryption = scram-sha-256
# WAL
wal_level = replica
# Archive
archive_mode = on
archive_command = 'rsync -av %p postgres@lab-3:/var/lib/postgresql/wal_archive_lab_3/%f'
# Replication slave
max_wal_senders = 2
wal_keep_segments = 100
hot_standby = on

/var/lib/postgresql/10/main/recovery.conf

standby_mode = on
primary_conninfo = 'host=master-ip port=5432 user=replicate password=replicate_password sslmode=require application_name="lab-3"'
trigger_file = '/var/lib/postgresql/10/postgresql.trigger'

当它挂起时,我在日志文件中完全没有得到任何东西,只是当我 Ctrl+C 中止主实例时的错误:

WARNING: canceling wait for synchronous replication due to user request

DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.

有没有办法检查追加的内容,以及为什么它会像这样卡住?

编辑 1

pg_stat_replication的内容:

查询前

  pid  | usesysid |  usename  | application_name | client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+-----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------
 54431 |    16384 | replicate | "lab-3"          | slave-ip     |                 |       47742 | 2019-08-06 07:56:48.105056+02 |              | streaming | 0/110000D0 | 0/110000D0 | 0/110000D0 | 0/110000D0 |           |           |            |             0 | async

(1 行)

当它挂起时/之后

  pid  | usesysid |  usename  | application_name | client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn |    write_lag    |    flush_lag    |  replay_lag   | sync_priority | sync_state
-------+----------+-----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------------+-----------------+---------------+---------------+------------
 54431 |    16384 | replicate | "lab-3"          | slave-ip     |                 |       47742 | 2019-08-06 07:56:48.105056+02 |              | streaming | 0/11000C10 | 0/11000C10 | 0/11000C10 | 0/11000C10 | 00:00:00.000521 | 00:00:00.004421 | 00:00:00.0045 |             0 | async

(1 行)

谢谢!

正如 Laurenz Albe 所说,问题出在同步备用名称的引用上。

文档解释说,如果它包含破折号,它应该在主服务器上的 synchronous_standby_names 配置条目中被引用,但它不能在从属服务器上的 primary_conninfo 值中被引用。