复制槽的 PostgreSQL BDR 问题
PostgreSQL BDR issues with replication slots
如果可以的话,我想在添加更多细节以使我的问题更清晰、更清晰之前写一个问题的简短摘要:
我在生产环境中有一个 BDR 组 运行ning,复制在所有节点中工作正常,但我在其中一个节点中收到关于复制槽的错误,我想在不破坏任何东西的情况下修复否则...
更多详情:
我在 PostgreSQL 9.4 上有一个三节点 BDR 组 运行ning,其中 2 个节点使用 RedHat7,另一个节点使用 Centos7。数据库名称是“MCT”。
node_sysid |node_timeline|node_dboid|node_status|node_name |node_local_dsn |node_init_from_dsn |node_read_only|
-------------------|-------------|----------|-----------|----------|---------------|-------------------|--------------|
6363474972929529353| 4| 263746|r |REDHAT_1 |172.16.5.180 | |false |
6182181726884158214| 2| 623638|r |REDHAT_2 |192.168.251.30 | 172.16.5.180 |false |
6197060214704992943| 2| 224093|r |CENTOS_1 |172.16.5.184 | 172.16.5.180 |false |
从任何节点到其余节点的复制工作正常,但是...
所有节点都在其日志中显示复制槽问题:
REDHAT_1:
2020-11-25 13:26:50.051 CET,"postgres","MCT",6607,"192.168.251.30:60463",5fbe4d8a.19cf,1,"idle",2020-11-25 13:26:50 CET,23/0,0,ERROR,55006,"replication slot ""bdr_263746_6182181726884158214_2_16389__"" is already active for pid 6097",,,,,,,,,"bdr (6182181726884158214,2,16389,):receive"
REDHAT_2:
2020-11-25 13:26:36.153 CET,,,2142,,5fbe48a5.85e,1015,,2020-11-25 13:05:57 CET,,0,LOG,00000,"proceso «background worker»: bdr (6182181726884158214,2,16389,)->bdr (6197060214704992943,2, (PID 12152) terminó con código de salida 1",,,,,,,,,""
2020-11-25 13:26:39.375 CET,,,2142,,5fbe48a5.85e,1016,,2020-11-25 13:05:57 CET,,0,LOG,00000,"iniciando el proceso «background worker» «bdr (6182181726884158214,2,16389,)->bdr (6363474972929529353,4,»",,,,,,,,,""
2020-11-25 13:26:39.389 CET,,,12154,,5fbe4d7f.2f7a,1,,2020-11-25 13:26:39 CET,14/0,0,FATAL,XX000,"could not send replication command ""START_REPLICATION SLOT ""bdr_263746_6182181726884158214_2_16389__"" LOGICAL CE4/BD6AF6A8 (pg_version '90410', pg_catversion '201409291', bdr_version '10002', bdr_variant 'BDR', min_bdr_version '700', sizeof_int '4', sizeof_long '8', sizeof_datum '8', maxalign '8', float4_byval '1', float8_byval '1', integer_datetimes '1', bigendian '0', db_encoding 'UTF8', replication_sets '""default""', db_encoding 'UTF8')"": ERROR: replication slot ""bdr_263746_6182181726884158214_2_16389__"" is already active for pid 6097, sqlstate: 55006",,,,,,,,,"bdr (6182181726884158214,2,16389,): apply"
2020-11-25 13:26:39.390 CET,,,2142,,5fbe48a5.85e,1017,,2020-11-25 13:05:57 CET,,0,LOG,00000,"proceso «background worker»: bdr (6182181726884158214,2,16389,)->bdr (6363474972929529353,4, (PID 12154) terminó con código de salida 1",,,,,,,,,""
2020-11-25 13:26:41.155 CET,,,2142,,5fbe48a5.85e,1018,,2020-11-25 13:05:57 CET,,0,LOG,00000,"iniciando el proceso «background worker» «bdr (6182181726884158214,2,16389,)->bdr (6197060214704992943,2,»",,,,,,,,,""
2020-11-25 13:26:41.165 CET,,,12155,,5fbe4d81.2f7b,1,,2020-11-25 13:26:41 CET,14/0,0,FATAL,XX000,"could not send replication command ""START_REPLICATION SLOT ""bdr_224093_6182181726884158214_2_16389__"" LOGICAL 9B4/F020E568 (pg_version '90410', pg_catversion '201409291', bdr_version '10002', bdr_variant 'BDR', min_bdr_version '700', sizeof_int '4', sizeof_long '8', sizeof_datum '8', maxalign '8', float4_byval '1', float8_byval '1', integer_datetimes '1', bigendian '0', db_encoding 'UTF8', replication_sets '""default""', db_encoding 'UTF8')"": ERROR: replication slot ""bdr_224093_6182181726884158214_2_16389__"" is already active for pid 2192
CENTOS_1:
2020-11-25 13:26:56.208 CET,"postgres","MCT",2668,"192.168.251.30:55270",5fbe4d90.a6c,1,"idle",2020-11-25 13:26:56 CET,10/0,0,ERROR,55006,"replication slot ""bdr_224093_6182181726884158214_2_16389__"" is already active for pid 2192",,,,,,,,,"bdr (6182181726884158214,2,16389,):receive"
如果我运行这个查询:
select
pg_catalog.inet_server_addr() as server_addr,
pg_xlog_location_diff(pg_current_xlog_insert_location(),
flush_location)::int8 as lag_bytes,
pid,
application_name,
client_addr,
state
from
pg_stat_replication;
我得到这个结果:
REDHAT_1:
server_addr |lag_bytes|pid |application_name |client_addr |state |
------------|---------|----|-------------------------------------------|--------------|---------|
172.16.5.180| 21088|6131|bdr (6197060214704992943,2,224093,):receive|172.16.5.184 |streaming|
172.16.5.180| 14440|6097|bdr (6182181726884158214,2,16389,):receive |192.168.251.30|streaming|
172.16.5.180| 3616|6099|bdr (6182181726884158214,2,623638,):receive|192.168.251.30|streaming|
REDHAT_2:
server_addr |lag_bytes|pid |application_name |client_addr |state |
------------|---------|----|-------------------------------------------|--------------|---------|
172.16.5.184| 0|2191|bdr (6182181726884158214,2,623638,):receive|192.168.251.30|streaming|
172.16.5.184| 0|2192|bdr (6182181726884158214,2,16389,):receive |192.168.251.30|streaming|
172.16.5.184| 0|2203|bdr (6363474972929529353,4,263746,):receive|172.16.5.180 |streaming|
CENTOS_1:
server_addr |lag_bytes|pid |application_name |client_addr |state |
--------------|---------|----|-------------------------------------------|--------------|---------|
192.168.251.30| 0|4298|bdr (6182181726884158214,2,16389,):receive |192.168.251.30|streaming|
192.168.251.30| 0|5123|bdr (6197060214704992943,2,224093,):receive|172.16.5.184 |streaming|
192.168.251.30| 0|4313|bdr (6363474972929529353,4,263746,):receive|172.16.5.180 |streaming|
这个查询:
select
slot_name,
plugin,
slot_type,
datoid,
"database",
active,
xmin,
catalog_xmin,
restart_lsn
from
pg_catalog.pg_replication_slots;
REDHAT_1:
slot_name |plugin|slot_type|datoid|database|active|xmin|catalog_xmin|restart_lsn |
-----------------------------------------|------|---------|------|--------|------|----|------------|------------|
bdr_263746_6182181726884158214_2_16389__ |bdr |logical |263746|MCT |true | |2466301995 |CE4/CDA614E8|
bdr_263746_6182181726884158214_2_623638__|bdr |logical |263746|MCT |true | |2466301995 |CE4/CDA614E8|
bdr_263746_6197060214704992943_2_224093__|bdr |logical |263746|MCT |true | |2466301995 |CE4/CDA614E8|
REDHAT_2:
slot_name |plugin|slot_type|datoid|database|active|xmin|catalog_xmin|restart_lsn |
-----------------------------------------|------|---------|------|--------|------|----|------------|------------|
bdr_224093_6182181726884158214_2_16389__ |bdr |logical |224093|MCT |true | |1464775472 |9B4/FED0BCA8|
bdr_224093_6182181726884158214_2_623638__|bdr |logical |224093|MCT |true | |1464775472 |9B4/FED0BCA8|
bdr_224093_6363474972929529353_4_263746__|bdr |logical |224093|MCT |true | |1464775472 |9B4/FED0BCA8|
CENTOS_1:
slot_name |plugin|slot_type|datoid|database|active|xmin|catalog_xmin|restart_lsn |
-----------------------------------------|------|---------|------|--------|------|----|------------|------------|
bdr_623638_6363474972929529353_4_263746__|bdr |logical |623638|MCT |true | |546376380 |85C/647FE110|
bdr_623638_6182181726884158214_2_16389__ |bdr |logical |623638|MCT |true | |546376380 |85C/647FE110|
bdr_623638_6197060214704992943_2_224093__|bdr |logical |623638|MCT |true | |546376380 |85C/647FE110|
所以,我想知道如何在不破坏当前正在生产的 BDR 中的任何内容的情况下消除该错误。
有什么想法吗?
提前致谢!
根本问题是 REDHAT_2 节点中名为“OLD”的第二个数据库试图与所有节点中的“MCT”同步并取得复制槽的所有权,因此真正的“MCT”节点无法真正打开连接。
在“旧”数据库中关闭 BDR 并重新启动 PSQL 解决了这个问题。谢谢
如果可以的话,我想在添加更多细节以使我的问题更清晰、更清晰之前写一个问题的简短摘要:
我在生产环境中有一个 BDR 组 运行ning,复制在所有节点中工作正常,但我在其中一个节点中收到关于复制槽的错误,我想在不破坏任何东西的情况下修复否则...
更多详情:
我在 PostgreSQL 9.4 上有一个三节点 BDR 组 运行ning,其中 2 个节点使用 RedHat7,另一个节点使用 Centos7。数据库名称是“MCT”。
node_sysid |node_timeline|node_dboid|node_status|node_name |node_local_dsn |node_init_from_dsn |node_read_only|
-------------------|-------------|----------|-----------|----------|---------------|-------------------|--------------|
6363474972929529353| 4| 263746|r |REDHAT_1 |172.16.5.180 | |false |
6182181726884158214| 2| 623638|r |REDHAT_2 |192.168.251.30 | 172.16.5.180 |false |
6197060214704992943| 2| 224093|r |CENTOS_1 |172.16.5.184 | 172.16.5.180 |false |
从任何节点到其余节点的复制工作正常,但是...
所有节点都在其日志中显示复制槽问题:
REDHAT_1:
2020-11-25 13:26:50.051 CET,"postgres","MCT",6607,"192.168.251.30:60463",5fbe4d8a.19cf,1,"idle",2020-11-25 13:26:50 CET,23/0,0,ERROR,55006,"replication slot ""bdr_263746_6182181726884158214_2_16389__"" is already active for pid 6097",,,,,,,,,"bdr (6182181726884158214,2,16389,):receive"
REDHAT_2:
2020-11-25 13:26:36.153 CET,,,2142,,5fbe48a5.85e,1015,,2020-11-25 13:05:57 CET,,0,LOG,00000,"proceso «background worker»: bdr (6182181726884158214,2,16389,)->bdr (6197060214704992943,2, (PID 12152) terminó con código de salida 1",,,,,,,,,""
2020-11-25 13:26:39.375 CET,,,2142,,5fbe48a5.85e,1016,,2020-11-25 13:05:57 CET,,0,LOG,00000,"iniciando el proceso «background worker» «bdr (6182181726884158214,2,16389,)->bdr (6363474972929529353,4,»",,,,,,,,,""
2020-11-25 13:26:39.389 CET,,,12154,,5fbe4d7f.2f7a,1,,2020-11-25 13:26:39 CET,14/0,0,FATAL,XX000,"could not send replication command ""START_REPLICATION SLOT ""bdr_263746_6182181726884158214_2_16389__"" LOGICAL CE4/BD6AF6A8 (pg_version '90410', pg_catversion '201409291', bdr_version '10002', bdr_variant 'BDR', min_bdr_version '700', sizeof_int '4', sizeof_long '8', sizeof_datum '8', maxalign '8', float4_byval '1', float8_byval '1', integer_datetimes '1', bigendian '0', db_encoding 'UTF8', replication_sets '""default""', db_encoding 'UTF8')"": ERROR: replication slot ""bdr_263746_6182181726884158214_2_16389__"" is already active for pid 6097, sqlstate: 55006",,,,,,,,,"bdr (6182181726884158214,2,16389,): apply"
2020-11-25 13:26:39.390 CET,,,2142,,5fbe48a5.85e,1017,,2020-11-25 13:05:57 CET,,0,LOG,00000,"proceso «background worker»: bdr (6182181726884158214,2,16389,)->bdr (6363474972929529353,4, (PID 12154) terminó con código de salida 1",,,,,,,,,""
2020-11-25 13:26:41.155 CET,,,2142,,5fbe48a5.85e,1018,,2020-11-25 13:05:57 CET,,0,LOG,00000,"iniciando el proceso «background worker» «bdr (6182181726884158214,2,16389,)->bdr (6197060214704992943,2,»",,,,,,,,,""
2020-11-25 13:26:41.165 CET,,,12155,,5fbe4d81.2f7b,1,,2020-11-25 13:26:41 CET,14/0,0,FATAL,XX000,"could not send replication command ""START_REPLICATION SLOT ""bdr_224093_6182181726884158214_2_16389__"" LOGICAL 9B4/F020E568 (pg_version '90410', pg_catversion '201409291', bdr_version '10002', bdr_variant 'BDR', min_bdr_version '700', sizeof_int '4', sizeof_long '8', sizeof_datum '8', maxalign '8', float4_byval '1', float8_byval '1', integer_datetimes '1', bigendian '0', db_encoding 'UTF8', replication_sets '""default""', db_encoding 'UTF8')"": ERROR: replication slot ""bdr_224093_6182181726884158214_2_16389__"" is already active for pid 2192
CENTOS_1:
2020-11-25 13:26:56.208 CET,"postgres","MCT",2668,"192.168.251.30:55270",5fbe4d90.a6c,1,"idle",2020-11-25 13:26:56 CET,10/0,0,ERROR,55006,"replication slot ""bdr_224093_6182181726884158214_2_16389__"" is already active for pid 2192",,,,,,,,,"bdr (6182181726884158214,2,16389,):receive"
如果我运行这个查询:
select
pg_catalog.inet_server_addr() as server_addr,
pg_xlog_location_diff(pg_current_xlog_insert_location(),
flush_location)::int8 as lag_bytes,
pid,
application_name,
client_addr,
state
from
pg_stat_replication;
我得到这个结果:
REDHAT_1:
server_addr |lag_bytes|pid |application_name |client_addr |state |
------------|---------|----|-------------------------------------------|--------------|---------|
172.16.5.180| 21088|6131|bdr (6197060214704992943,2,224093,):receive|172.16.5.184 |streaming|
172.16.5.180| 14440|6097|bdr (6182181726884158214,2,16389,):receive |192.168.251.30|streaming|
172.16.5.180| 3616|6099|bdr (6182181726884158214,2,623638,):receive|192.168.251.30|streaming|
REDHAT_2:
server_addr |lag_bytes|pid |application_name |client_addr |state |
------------|---------|----|-------------------------------------------|--------------|---------|
172.16.5.184| 0|2191|bdr (6182181726884158214,2,623638,):receive|192.168.251.30|streaming|
172.16.5.184| 0|2192|bdr (6182181726884158214,2,16389,):receive |192.168.251.30|streaming|
172.16.5.184| 0|2203|bdr (6363474972929529353,4,263746,):receive|172.16.5.180 |streaming|
CENTOS_1:
server_addr |lag_bytes|pid |application_name |client_addr |state |
--------------|---------|----|-------------------------------------------|--------------|---------|
192.168.251.30| 0|4298|bdr (6182181726884158214,2,16389,):receive |192.168.251.30|streaming|
192.168.251.30| 0|5123|bdr (6197060214704992943,2,224093,):receive|172.16.5.184 |streaming|
192.168.251.30| 0|4313|bdr (6363474972929529353,4,263746,):receive|172.16.5.180 |streaming|
这个查询:
select
slot_name,
plugin,
slot_type,
datoid,
"database",
active,
xmin,
catalog_xmin,
restart_lsn
from
pg_catalog.pg_replication_slots;
REDHAT_1:
slot_name |plugin|slot_type|datoid|database|active|xmin|catalog_xmin|restart_lsn |
-----------------------------------------|------|---------|------|--------|------|----|------------|------------|
bdr_263746_6182181726884158214_2_16389__ |bdr |logical |263746|MCT |true | |2466301995 |CE4/CDA614E8|
bdr_263746_6182181726884158214_2_623638__|bdr |logical |263746|MCT |true | |2466301995 |CE4/CDA614E8|
bdr_263746_6197060214704992943_2_224093__|bdr |logical |263746|MCT |true | |2466301995 |CE4/CDA614E8|
REDHAT_2:
slot_name |plugin|slot_type|datoid|database|active|xmin|catalog_xmin|restart_lsn |
-----------------------------------------|------|---------|------|--------|------|----|------------|------------|
bdr_224093_6182181726884158214_2_16389__ |bdr |logical |224093|MCT |true | |1464775472 |9B4/FED0BCA8|
bdr_224093_6182181726884158214_2_623638__|bdr |logical |224093|MCT |true | |1464775472 |9B4/FED0BCA8|
bdr_224093_6363474972929529353_4_263746__|bdr |logical |224093|MCT |true | |1464775472 |9B4/FED0BCA8|
CENTOS_1:
slot_name |plugin|slot_type|datoid|database|active|xmin|catalog_xmin|restart_lsn |
-----------------------------------------|------|---------|------|--------|------|----|------------|------------|
bdr_623638_6363474972929529353_4_263746__|bdr |logical |623638|MCT |true | |546376380 |85C/647FE110|
bdr_623638_6182181726884158214_2_16389__ |bdr |logical |623638|MCT |true | |546376380 |85C/647FE110|
bdr_623638_6197060214704992943_2_224093__|bdr |logical |623638|MCT |true | |546376380 |85C/647FE110|
所以,我想知道如何在不破坏当前正在生产的 BDR 中的任何内容的情况下消除该错误。
有什么想法吗?
提前致谢!
根本问题是 REDHAT_2 节点中名为“OLD”的第二个数据库试图与所有节点中的“MCT”同步并取得复制槽的所有权,因此真正的“MCT”节点无法真正打开连接。
在“旧”数据库中关闭 BDR 并重新启动 PSQL 解决了这个问题。谢谢