如何检查 PostgreSQL 中的复制延迟？

Question

我想测量在 PostgreSQL 9.3 中使用流复制将数据插入 master-table 和 slave-table 之间的时间。为此，我创建了 table test_time 和 2 个字段 id(serial)，t(text)。之后添加了一个触发器：

cur_time:=to_char(current_timestamp, 'HH12:MI:SS:MS:US'); update test_time set t=cur_time where id=new.id;

但是两个 table 的时间是一样的。如何测量延迟时间

Answer 1

您可以使用 pg_xlog_location_diff 很容易地从主机端获得延迟字节比较主机端的 pg_current_xlog_insert_location 与 replay_location对于该后端的 pg_stat_replication 条目。

这仅在运行主服务器上有效。您不能从副本执行此操作，因为副本不知道主服务器领先多远。

此外，这不会告诉您秒的延迟。在当前（至少从 9.4 开始）的 PostgreSQL 版本中，没有与提交或 WAL 记录关联的时间戳。所以没有办法知道给定的 LSN（xlog 位置）是多久以前的。

在当前 PostgreSQL 版本上以秒为单位获得副本滞后的唯一方法是让外部进程定期将 update 提交到专用时间戳 table。因此，您可以将副本上的 current_timestamp 与副本上可见的 table 中最近条目的时间戳进行比较，以查看副本落后多远。这会创建额外的 WAL 流量，然后必须将这些流量保存在用于 PITR（PgBarman 或其他）的存档 WAL 中，因此您应该在增加的数据使用与所需的滞后检测粒度之间取得平衡。

PostgreSQL 9.5 可能会添加提交时间戳，这有望让您了解给定提交发生的时间以及副本在挂钟秒数内的落后程度。

Answer 2

Alf162 在对 Craig Ringer 的回答的评论中提到了一个很好的解决方案；所以我添加这个来澄清。

PostgreSQL 有一个管理函数 pg_last_xact_replay_timestamp()，它在恢复过程中重播最后一个事务的 returns 时间戳。这是在主数据库上生成该事务的提交或中止 WAL 记录的时间。

所以这个在副本上的查询select now()-pg_last_xact_replay_timestamp() as replication_lag将return表示当前时钟和从复制流应用的最后一个WAL记录的时间戳之间的时间差的持续时间。

请注意，如果 master 没有接收到新的突变，则不会有 WAL 记录流式传输，并且以这种方式计算的滞后会增加，但实际上并不是复制延迟的信号。如果 master 处于或多或少的连续突变下，它将连续流式传输 WAL，并且上面的查询是对 master 上的更改在 slave 上实现的时间延迟的一个很好的近似。准确性显然会受到两台主机上系统时钟同步程度的影响。

Answer 3

正确答案略有不同：

postgres=# SELECT
  pg_last_xlog_receive_location() receive,
  pg_last_xlog_replay_location() replay,
  (
   extract(epoch FROM now()) -
   extract(epoch FROM pg_last_xact_replay_timestamp())
  )::int lag;

  receive   |   replay   |  lag  
------------+------------+-------
 1/AB861728 | 1/AB861728 | 2027

延迟仅在 "receive" 不等于 "replay" 时才重要。在副本上执行查询

Answer 4

如果您的数据库频繁写入，那么下面的查询是获取从延迟的近似值

select now() - pg_last_xact_replay_timestamp() AS replication_delay;

下面是一个更准确的查询，用于计算写入次数很少的数据库的复制延迟。如果主机没有向从机发送任何写入，则 pg_last_xact_replay_timestamp() 可以是常量，因此使用上述查询可能无法准确确定从机延迟。

SELECT CASE WHEN pg_last_xlog_receive_location() =
pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() -
pg_last_xact_replay_timestamp()) END AS log_delay;

Answer 5

截至 10 版本：

https://www.postgresql.org/docs/10/static/monitoring-stats.html#pg-stat-replication-view

write_lag interval Time elapsed between flushing recent WAL locally and receiving notification that this standby server has written it (but not yet flushed it or applied it). This can be used to gauge the delay that synchronous_commit level remote_write incurred while committing if this server was configured as a synchronous standby.

flush_lag interval Time elapsed between flushing recent WAL locally and receiving notification that this standby server has written and flushed it (but not yet applied it). This can be used to gauge the delay that synchronous_commit level remote_flush incurred while committing if this server was configured as a synchronous standby.

replay_lag interval Time elapsed between flushing recent WAL locally and receiving notification that this standby server has written, flushed and applied it. This can be used to gauge the delay that synchronous_commit level remote_apply incurred while committing if this server was configured as a synchronous standby.

（正在格式化我的）

唉，新列似乎只适合同步复制（否则 master 不会知道确切的延迟）因此异步复制延迟检查似乎仍然存在 now()-pg_last_xact_replay_timestamp()...

Answer 6

在 master 上，你可以做 select * from pg_stat_replication;
这会给你：

|  sent_lsn   |  write_lsn  |  flush_lsn  | replay_lsn  

-+-------------+-------------+-------------+-------------

 | 8D/2DA48000 | 8D/2DA48000 | 8D/2DA48000 | 89/56A0D500

那些可以告诉你偏移量在哪里。从这个例子可以看出，副本上的重播落后了。

Answer 7

对于 postgresql 10 或更高版本（此版本中不存在函数 pg_last_xlog_receive_location() 和其他函数），我使用这个：

SELECT
  pg_is_in_recovery() AS is_slave,
  pg_last_wal_receive_lsn() AS receive,
  pg_last_wal_replay_lsn() AS replay,
  pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() AS synced,
  (
   EXTRACT(EPOCH FROM now()) -
   EXTRACT(EPOCH FROM pg_last_xact_replay_timestamp())
  )::int AS lag;

如果您运行在 master 上进行此查询，结果将是：

 is_slave | receive | replay | synced | lag 
----------+---------+--------+--------+-----
 f        |         |        |        |    
(1 row)

如果你运行在同步从站上查询，结果将是这样的：

 is_slave |  receive  |  replay   | synced | lag 
----------+-----------+-----------+--------+-----
 t        | 0/3003128 | 0/3003128 | t      | 214
(1 row)

如果你运行在 NOT synced slave 上查询，结果将是这样的：

 is_slave |  receive  |  replay   | synced | lag 
----------+-----------+-----------+--------+-----
 t        | 0/30030F0 | 0/30023B0 | f      | 129
(1 row)

注意：lag（秒）在这里有特殊含义（与pg_stat_replication中的replay_lag/write_lag/flush_lag不同视图）并且它 仅在 synced 列为 false 时有用，因为 lag 表示自上次操作提交以来经过了多少秒。在低流量站点中，此值无用。但是在高流量站点中，synced 可能（并且将会）几乎是时间 false，但是如果它的 lag 值足够小，服务器可以被认为是同步的。

因此，为了发现该服务器是否已同步，我检查（按此顺序）：

IF is_slave 是 f（这意味着它不是奴隶，可能是主人，所以它是同步的）；
IF synced is t（意思是同步从机，所以同步）；
IF（假设适用）lag <= :threshold:（这意味着它不是一个同步的奴隶，但它离主人不远，所以它对我来说已经足够同步了）。

如果你想要延迟秒数，包括小数，请执行：

SELECT
  pg_is_in_recovery() AS is_slave,
  pg_last_wal_receive_lsn() AS receive,
  pg_last_wal_replay_lsn() AS replay,
  pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() AS synced,
  EXTRACT(SECONDS FROM now() - pg_last_xact_replay_timestamp())::float AS lag;

Answer 8

您可以使用这个简单的基于 CLI 的开源工具，它可以使用各种模式提供有关复制滞后的实时可视化，例如CLI、Web 模式以及基于 matplotlib 的图表，便于跟踪。

Replication-Lag-Visualizer

欢迎提出任何问题或参与其中。

如何检查 PostgreSQL 中的复制延迟？

How to check the replication delay in PostgreSQL?

postgresql

replication