pcp_recovery_node 命令在恢复待机时挂起

Question

它是 cluster 的子部分，我正在构建。当我在 master 上执行 pcp_recovery_node 以使用命令

从头开始构建备用数据库时

pcp_recovery_node -h 193.185.83.119 -p 9898 -U postgres -n 1

这里193.185.83.119是vip。它成功地在 node-b 上构建并启动了备用节点（假设节点是 node-a 和 node-b），但同时上面的命令没有 return 并且只是挂在 shell 中，如 :-

[postgres@rollc-filesrvr1 data]$ pcp_recovery_node -h 193.185.83.119 -p 9898 -U postgres -n 1 Password:

我必须使用 ctrl+c 才能退出此会话。稍后当我尝试在 node-a (master) 上创建测试数据库时，出现以下错误：

      postgres=# create database test;
        ERROR:  source database "template1" is being accessed by other users
        DETAIL:  There is 1 other session using the database.

我确认 pgpool.service 是运行在节点-a 上执行运行这个命令时我已经尝试使用 on/off pgpool.service发出 pcp 命令之前的 node-b（备用）。结果还是一样。

我还尝试谷歌搜索并调整了 pgpool.conf 中的设置。我不确定是否可以使用这些参数：

wd_lifecheck_dbname in pgpool.conf

最初与上述相关的设置是（我仍然得到相同的结果）：

wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''

后来，我在 here, here and one suggestion at here 找到了不同的设置，并尝试了如下不同的组合：

wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'postgres'
wd_lifecheck_password = ''

或

wd_lifecheck_dbname = 'postgres'
wd_lifecheck_user = 'postgres'
wd_lifecheck_password = ''

但是 none 帮助改变了 shell 上的情况，也不允许我在 master 上创建测试数据库。我觉得，我走到了死胡同。

我仍然无法完全理解 pgpool 中上述 3 个参数的用途和含义，并且以某种方式怀疑这些是我配置不正确的参数，尽管也可能有其他原因。

只是为了帮助，这里又是环境的详细信息。

node-a 和 nod-b 环境：rhel 7.6
postgres 版本：10.7
pgpool-||版本：4.0.3
复制槽+wal存档

这是来自 node-a 的日志 pgpool.service

Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 16642: LOG:  forked new pcp worker, pid=8534 socket=7
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  starting recovering node 1
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  executing recovery
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: DETAIL:  starting recovery command: "SELECT pgpool_recovery('recovery_1st_stage', 'node-a-ip', '/data/test/data', '5438', 1)"
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  executing recovery
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: DETAIL:  disabling statement_timeout
Mar 18 21:10:18 node-a pgpool[16583]: 2019-03-18 21:10:18: pid 8534: LOG:  node recovery, 1st stage is done
Mar 18 21:11:37 node-a pgpool[16583]: 2019-03-18 21:11:37: pid 8534: LOG:  checking if postmaster is started
Mar 18 21:11:37 node-a pgpool[16583]: 2019-03-18 21:11:37: pid 8534: DETAIL:  trying to connect to postmaster on hostname:node-b-ip database:postgres user:postgres (retry 0 times)
...
...2 more times 
Mar 18 21:11:49 node-a pgpool[16583]: 2019-03-18 21:11:49: pid 8534: LOG:  checking if postmaster is started
Mar 18 21:11:49 node-a pgpool[16583]: 2019-03-18 21:11:49: pid 8534: DETAIL:  trying to connect to postmaster on hostname:node-a-ip database:template1 user:postgres (retry 0 times)
...it keeps on trying till i press ctrl+c on pcp command windows . I have seen it going upto 30 or more.

此外，在使用 pgpool 检查时，node-b 从未显示为已启动。

postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change ---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+--------------------- 0 | node-a-ip | 5438 | up | 0.500000 | primary | 0 | true | 0 | 2019-03-18 22:59:19 1 | node-b-ip | 5438 | down | 0.500000 | standby | 0 | false | 0 | 2019-03-18 22:59:19 (2 rows)

EDIT 现在我至少能够更正此查询的最后一部分。即，将备用节点添加到集群：

[postgres@node-a-hostname]$ pcp_attach_node -n 1 Password: pcp_attach_node -- Command Successful

现在最后一部分至少显示了正确的情况：

postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change ---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+--------------------- 0 | node-a-ip | 5438 | up | 0.500000 | primary | 0 | false | 0 | 2019-03-18 22:59:19 1 | node-b-ip | 5438 | up | 0.500000 | standby | 0 | true | 0 | 2019-03-19 11:38:38 (2 rows)

但是无法在 node1 上创建数据库的潜在问题仍然存在：

EDIT2: 我尝试在 master 上插入和更新它们，它们被正确地复制到 node2 但创建数据库仍然无法正常工作。

Answer 1

对 EDIT1 的第一次更正：确实 pcp_attach_node 帮助更正了 show pool_nodes 的输出，但它使问题更加复杂，因为其他命令

pcp_watchdog_info -h 193.185.83.119 -p 9898 -U postgres

开始卡住了。后来，我发现

pcp_attach_node -n 1

根本不需要附加备用或更正显示 pool_nodes 的输出；在 master IF original pcp_recovery_node 上正确完成。

好吧，最初问题的根本原因，以及后来看门狗卡住的根源，是 pgpool_remote_start 脚本即使在启动待机后也没有正确完成。我可以在

中看到它

ps -ef | grep pgpool

在大师上。

我在 here 联系了 pgpool_bug_tracking 系统，他们帮助我进一步修复了它。 pgpool_remote_start 中的 postgres 启动命令不正确，导致了问题，因此 pcp_recover_node 没有完成，以后也没有完成。

pgpool_remote_start 中的正确命令应该是这样的（我用过它）：

ssh -T postgres@$REMOTE_HOST /usr/pgsql-10/bin/pg_ctl -w start -D /data/test/data 2>/dev/null 1>/dev/null </dev/null &

我正在使用

ssh -T postgres@$REMOTE_HOST /usr/pgsql-10/bin/pg_ctl start -D /data/test/data

我缺少 -w 标志。也没有将 stdout 和 stderr 重定向到 /dev/null 并且缺少发送给它的 EOF 信号。

我仍然不清楚，但对遇到类似问题的人有帮助：首先在待机状态下启动 pgpool.service 或在主服务器上发出 pcp 命令之前保持它运行。

pcp_recovery_node 命令在恢复待机时挂起

pcp_recovery_node command hangs while recovering standby

postgresql

high-availability

database-replication

pgpool

postgresql-10