pgsqld monitor timed out on master when sync slave crashed #218

asphator · 2023-01-31T18:41:31Z

Hi there

Yesterday, I had an issue I cannot fully explain (it did occur only once before, like 2 years ago; at the time, already couldn't figure out what was the root cause).
The closest topic related I could find would be issue #107 but with no satisfying explanation.

Our issue

A guy from another SRE did a mistake and stopped the wrong server through iDrac - an electrical poweroff - followed by an OS reinstallation. Unfortunately, it was the sync slave of one of our clusters.
Setup should be resilient to the loss of a slave but here, the monitor of pgsqld timed out and thus, master process restarted by pcs.
It has immediately and successfully restarted with 2nd slave (the async one) quickly up to date.
BUT I'm unable to understand nor reproduce this behavior. During our tests and benches, the expected behavior occurs: sync switches on 3rd node and master lives on (after a short freeze).
Of course, it happened during the day, with some load / locks on the databases. We didn't reach max connection though but active connections were hanging the minute before the restart.

Our setup

Infrastructure

- 3 datacenters 
- 1 node per datacenter: Master + Slave sync + Slave async
- streaming replication (not using slots)
- paf agent release: 2.3.1
- sbd watchdog
- slaves connecting on master's DNS (layer 3).
- DNS record matches a VIP is using ExaBgp, meaning same IP is configure on all nodes but announced only if condition pg_is_in_recovery() is false. 
- each node has:
  -  CPU: 32 cores
  -  RAM: 192GB
  -  2x Network IF: 25GB
  -  multipath / iSCSI towards full flash Kaminario
- the cluster which faced the issue is a 29TB instance with a hundred databases

Postgres

As you guess, node 3 is 2 countries away while 1 and 2 are both around Paris

synchronous_standby_names:
  -  on srv11014 :   ' 1 ("srv21014", "srv31036") '
  -  on srv21014 :   ' 1 ("srv11014", "srv31036") '
  -  on srv31036 :   ' 1 ("srv11014", "srv21014") '
synchronous_commit: remote_apply
wal_sender_timeout: 60s
wal_receiver_timeout: 60s
wal_receiver_status_interval: 1s
tcp_keepalive_idle = 600s
tcp_keepalive_count = 10
tcp_keepalive_interval = 10
archive_command = '....'  # copying on S3 
restore_command = '....'  # downloading from S3
recovery_target_timeline = 'latest'

If you need more, just let me know which ones.

Cluster

Cluster Name: pgclu31
Corosync Nodes:
 srv11014.xxxxxxxxxxxx  srv21014.xxxxxxxxxxxx  srv31036.xxxxxxxxxxxx 
Pacemaker Nodes:
 srv11014.xxxxxxxxxxxx  srv21014.xxxxxxxxxxxx  srv31036.xxxxxxxxxxxx 

Resources:
 Master: pgsql-ha
  Meta Attrs: notify=true 
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/pgsql-9.6/bin maxlag=0 pgdata=/var/lib/pgsql/9.6/data/
   Operations: demote interval=0s timeout=240s (pgsqld-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               start interval=0s timeout=300s (pgsqld-start-interval-0s)
               stop interval=0s timeout=120s (pgsqld-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 Alert: sensu_alert (path=/foo/bar/check_hacluster)
  Recipients:
   Recipient: sensu_alert-recipient (value=PDINTKEY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: pgclu31
 dc-version: 1.1.16-12.el7-94ff4df
 have-watchdog: true
 last-lrm-refresh: 1675102250
 maintenance-mode: false
 no-quorum-policy: suicide
 stonith-enabled: true
 stonith-watchdog-timeout: 10s
Node Attributes:
 srv11014.xxxxxxxxxxxx : master-pgsqld=1001
 srv21014.xxxxxxxxxxxx : master-pgsqld=1000
 srv31036.xxxxxxxxxxxx : master-pgsqld=990

Quorum:
  Options:

Nominal state

Cluster name: pgclu31
Stack: corosync
Current DC: srv31036.xxxxxxxxxxxx  (3) (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Tue Jan 31 15:34:16 2023
Last change: Tue Jan 31 14:54:45 2023 by root via crm_attribute on srv11014.xxxxxxxxxxxx 

3 nodes configured
3 resources configured

Online: [ srv11014.xxxxxxxxxxxx  (1) srv21014.xxxxxxxxxxxx  (2) srv31036.xxxxxxxxxxxx  (3) ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     pgsqld	(ocf::heartbeat:pgsqlms):	Master srv11014.xxxxxxxxxxxx 
     pgsqld	(ocf::heartbeat:pgsqlms):	Slave srv21014.xxxxxxxxxxxx 
     pgsqld	(ocf::heartbeat:pgsqlms):	Slave srv31036.xxxxxxxxxxxx 
     Masters: [ srv11014.xxxxxxxxxxxx  ]
     Slaves: [ srv21014.xxxxxxxxxxxx  srv31036.xxxxxxxxxxxx  ]

Node Attributes:
* Node srv11014.xxxxxxxxxxxx  (1):
    + master-pgsqld                   	: 1001      
* Node srv21014.xxxxxxxxxxxx  (2):
    + master-pgsqld                   	: 1000      
* Node srv31036.xxxxxxxxxxxx  (3):
    + master-pgsqld                   	: 990       

Migration Summary:
* Node srv11014.xxxxxxxxxxxx  (1):
* Node srv21014.xxxxxxxxxxxx  (2):
* Node srv31036.xxxxxxxxxxxx  (3):

PCSD Status:
  srv11014.xxxxxxxxxxxx : Online
  srv21014.xxxxxxxxxxxx : Online
  srv31036.xxxxxxxxxxxx : Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
  sbd: active/enabled

Timeline

around 13:45:52: srv21014 is brutally lost
13:45:52: srv11014 (master) is frozen
13:45:52: active connections increasing and hanging
13:46:17: srv21014 is now lost
13:46:17: monitor operation for pgsqld on srv11014 timed out
13:46:18: srv11014 received fast shutdown request, aborting active transactions
13:46:28: srv11014 is shutting down
13:46:54: walsender process timeout on 21014
13:47:31: walsender process timeout on 31036
13:47:39: srv11014 is shut down
13:47:41: srv11014 is starting up (as slave)
13:47:42: srv11014 is started
13:47:43: srv11014 is being promoted
13:47:46: promotion completed

Some logs

Syslog master (srv11014)

Jan 30 13:45:52  <=== node srv21014.xxxxxxxxxxxx has been stopped electrically around here /-30s
...
Jan 30 13:46:02 srv11014 sensu-agent: {"component":"command","error":"operation not permitted","level":"error","msg":"Execution timed out - Unable to TERM/KILL the process: #66168","time":"2023-01-30T13:46:02Z"}
Jan 30 13:46:02 srv11014 sensu-agent: {"check":"check-hacluster-status","component":"command","level":"warning","msg":"check or hook execution created zombie process - escaping in order for the check to execute again","time":"2023-
Jan 30 13:46:04 srv11014 corosync[15236]: [TOTEM ] A processor failed, forming new configuration.
Jan 30 13:46:07 srv11014 pgsqlms(pgsqld)[69063]: INFO: Update score of "srv31036.xxxxxxxxxxxx" from 990 to 1000 because of a change in the replication lag (0).
Jan 30 13:46:13 srv11014 healthcheck-pgclu31.xxxxxxxxxxxx[173794]: service up, restoring loopback ips
Jan 30 13:46:13 srv11014 healthcheck-pgclu31.xxxxxxxxxxxx[173794]: send announces for UP state to ExaBGP
Jan 30 13:46:17 srv11014 corosync[15236]: [TOTEM ] A new membership (x.x.x.x:x) was formed. Members left: 2
Jan 30 13:46:17 srv11014 corosync[15236]: [TOTEM ] Failed to receive the leave message. failed: 2
Jan 30 13:46:17 srv11014 cib[15270]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv11014 cib[15270]:  notice: Purged 1 peers with id=2 and/or uname=srv21014.xxxxxxxxxxxx from the membership cache
Jan 30 13:46:17 srv11014 stonith-ng[15271]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv11014 crmd[15275]:  notice: Our peer on the DC (srv21014.xxxxxxxxxxxx) is dead
Jan 30 13:46:17 srv11014 crmd[15275]:  notice: State transition S_NOT_DC -> S_ELECTION
Jan 30 13:46:17 srv11014 attrd[15273]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv11014 stonith-ng[15271]:  notice: Purged 1 peers with id=2 and/or uname=srv21014.xxxxxxxxxxxx from the membership cache
Jan 30 13:46:17 srv11014 attrd[15273]:  notice: Removing all srv21014.xxxxxxxxxxxx attributes for peer loss
Jan 30 13:46:17 srv11014 attrd[15273]:  notice: Lost attribute writer srv21014.xxxxxxxxxxxx
Jan 30 13:46:17 srv11014 attrd[15273]:  notice: Purged 1 peers with id=2 and/or uname=srv21014.xxxxxxxxxxxx from the membership cache
Jan 30 13:46:17 srv11014 corosync[15236]: [QUORUM] Members[2]: 1 3
Jan 30 13:46:17 srv11014 corosync[15236]: [MAIN  ] Completed service synchronization, ready to provide service.
Jan 30 13:46:17 srv11014 crmd[15275]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv11014 pacemakerd[15266]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv11014 pgsqlms(pgsqld)[69063]: INFO: Update score of "srv21014.xxxxxxxxxxxx" from 1000 to 990 because of a change in the replication lag (16617424).
Jan 30 13:46:17 srv11014 lrmd[15272]: warning: pgsqld_monitor_15000 process (PID 69063) timed out
Jan 30 13:46:17 srv11014 lrmd[15272]: warning: pgsqld_monitor_15000:69063 - timed out after 10000ms
Jan 30 13:46:17 srv11014 crmd[15275]:   error: Result of monitor operation for pgsqld on srv11014.xxxxxxxxxxxx: Timed Out
Jan 30 13:46:17 srv11014 crmd[15275]:  notice: srv11014.xxxxxxxxxxxx-pgsqld_monitor_15000:68 [ /tmp:5432 - accepting connections\n ]
Jan 30 13:46:17 srv11014 crmd[15275]:  notice: State transition S_ELECTION -> S_PENDING
Jan 30 13:46:17 srv11014 crmd[15275]:  notice: State transition S_PENDING -> S_NOT_DC
Jan 30 13:46:28 srv11014 stonith-ng[15271]:  notice: Operation reboot of srv21014.xxxxxxxxxxxx by srv31036.xxxxxxxxxxxx for [email protected]: OK
Jan 30 13:46:28 srv11014 crmd[15275]:  notice: Peer srv21014.xxxxxxxxxxxx was terminated (reboot) by srv31036.xxxxxxxxxxxx for srv31036.xxxxxxxxxxxx: OK (ref=837edffe-8e3d-4fdf-87ed-d912c8e17885) by client crmd.224813
Jan 30 13:46:28 srv11014 crmd[15275]:  notice: srv11014.xxxxxxxxxxxx-pgsqld_monitor_15000:68 [ /tmp:5432 - accepting connections\n ]
Jan 30 13:46:28 srv11014 postgres[250882]: [35-1] <  >LOG:  received fast shutdown request
Jan 30 13:46:28 srv11014 postgres[250882]: [36-1] <  >LOG:  aborting any active transactions
Jan 30 13:46:28 srv11014 postgres[147232]: [42-1] <x.x.x.x db02 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[207321]: [37-1] <x.x.x.x db14 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[200435]: [38-1] <x.x.x.x db02 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[252313]: [35-1] <x.x.x.x db11 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[241475]: [35-1] <x.x.x.x db54 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[238737]: [36-1] <x.x.x.y db16 dbuser>LOG:  duration: 32585.241 ms  statement: commit
Jan 30 13:46:28 srv11014 postgres[238737]: [37-1] <x.x.x.y db16 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[238737]: [37-2] <x.x.x.y db16 dbuser>STATEMENT:  commit
Jan 30 13:46:28 srv11014 postgres[137615]: [9116-1] <x.x.x.y db47 dbuser>LOG:  duration: 33470.409 ms  statement: commit
Jan 30 13:46:28 srv11014 postgres[137939]: [76141-1] <x.x.x.y db30 dbuser>LOG:  duration: 33221.234 ms  statement: commit
Jan 30 13:46:28 srv11014 postgres[137939]: [76142-1] <x.x.x.y db30 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[137615]: [9117-1] <x.x.x.y db47 dbuser>FATAL:  terminating connection due to administrator command
Jan 30 13:46:28 srv11014 postgres[137939]: [76142-2] <x.x.x.y db30 dbuser>STATEMENT:  commit
Jan 30 13:46:28 srv11014 postgres[137615]: [9117-2] <x.x.x.y db47 dbuser>STATEMENT:  commit
Jan 30 13:46:28 srv11014 postgres[141976]: [141656-1] <x.x.x.y db31 dbuser>LOG:  duration: 34055.413 ms  statement: commit
[...]
Jan 30 13:46:28 srv11014 postgres[72888]: [37-1] <[local] db59 dbuser>FATAL:  the database system is shutting down
Jan 30 13:46:28 srv11014 postgres[72899]: [37-1] <[local] db14 dbuser>FATAL:  the database system is shutting down
Jan 30 13:46:28 srv11014 postgres[72915]: [37-1] <x.x.x.z db15 dbuser>FATAL:  the database system is shutting down
[...]
Jan 30 13:46:54 srv11014 postgres[27140]: [20-1] <srv21014.xxxxxxxxxxxx [unknown] pgrepli>LOG:  terminating walsender process due to replication timeout
Jan 30 13:46:54 srv11014 postgres[251164]: [148691-1] <  >LOG:  checkpoint starting: shutdown immediate
Jan 30 13:46:54 srv11014 postgres[251164]: [148692-1] <  >LOG:  checkpoint complete: wrote 13685 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.349 s, sync=0.013 s, total=0.372 s; sync files=795, longest=0.006 s, average=0.001 s; distance=120896 kB, estimate=1263968 kB
Jan 30 13:46:55 srv11014 postgres[74131]: [37-1] <[local] postgres dbuser>FATAL:  the database system is shutting down
Jan 30 13:46:56 srv11014 postgres[74203]: [37-1] <[local] postgres exabgp>FATAL:  the database system is shutting down
[...]
Jan 30 13:47:28 srv11014 postgres[75946]: [37-1] <[local] postgres dbuser>FATAL:  the database system is shutting down
Jan 30 13:47:31 srv11014 postgres[13354]: [21-1] <srv31036.xxxxxxxxxxxx [unknown] pgrepli>LOG:  terminating walsender process due to replication timeout
Jan 30 13:47:36 srv11014 healthcheck-pgclu31.xxxxxxxxxxxx[173794]: Timeout (5) while running check command /bin/echo "select pg_is_in_recovery( );" | /usr/bin/psql postgres -At 2>/dev/null | /bin/grep -q "^f"
Jan 30 13:47:36 srv11014 healthcheck-pgclu31.xxxxxxxxxxxx[173794]: service down, deleting loopback ips
Jan 30 13:47:36 srv11014 healthcheck-pgclu31.xxxxxxxxxxxx[173794]: send announces for DOWN state to ExaBGP
Jan 30 13:47:39 srv11014 postgres[250882]: [37-1] <  >LOG:  database system is shut down
Jan 30 13:47:40 srv11014 pgsqlms(pgsqld)[72848]: INFO: Instance "pgsqld" stopped
Jan 30 13:47:40 srv11014 crmd[15275]:  notice: Result of stop operation for pgsqld on srv11014.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:41 srv11014 postgres[76460]: [1-1] <  >LOG:  redirecting log output to logging collector process
Jan 30 13:47:41 srv11014 postgres[76460]: [1-2] <  >HINT:  Future log output will appear in directory "pg_log".
Jan 30 13:47:41 srv11014 postgres[76460]: [2-1] <  >LOG:  ending log output to stderr
Jan 30 13:47:41 srv11014 postgres[76460]: [2-2] <  >HINT:  Future log output will go to log destination "syslog".
Jan 30 13:47:41 srv11014 postgres[76493]: [3-1] <[local] postgres exabgp>FATAL:  the database system is starting up
Jan 30 13:47:41 srv11014 postgres[76492]: [3-1] <  >LOG:  database system was shut down at 2023-01-30 13:46:54 GMT
Jan 30 13:47:41 srv11014 postgres[76512]: [3-1] <[local] postgres postgres>FATAL:  the database system is starting up
Jan 30 13:47:42 srv11014 postgres[76492]: [4-1] <  >LOG:  entering standby mode
Jan 30 13:47:42 srv11014 postgres[76586]: [3-1] <[local] postgres dbuser>FATAL:  the database system is starting up
Jan 30 13:47:42 srv11014 postgres[76492]: [5-1] <  >LOG:  consistent recovery state reached at 65D4E/2C000098
Jan 30 13:47:42 srv11014 postgres[76492]: [6-1] <  >LOG:  invalid record length at 65D4E/2C000098: wanted 24, got 0
Jan 30 13:47:42 srv11014 postgres[76460]: [3-1] <  >LOG:  database system is ready to accept read only connections
Jan 30 13:47:42 srv11014 pgsqlms(pgsqld)[76428]: INFO: Instance "pgsqld" started
Jan 30 13:47:43 srv11014 crmd[15275]:  notice: Result of start operation for pgsqld on srv11014.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:43 srv11014 crmd[15275]:  notice: Result of notify operation for pgsqld on srv11014.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:43 srv11014 pgsqlms(pgsqld)[76668]: INFO: Promoting instance on node "srv11014.xxxxxxxxxxxx"
Jan 30 13:47:43 srv11014 pgsqlms(pgsqld)[76668]: INFO: Current node TL#LSN: 28#1791440187293696
Jan 30 13:47:43 srv11014 crmd[15275]:  notice: Result of notify operation for pgsqld on srv11014.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:44 srv11014 postgres[76492]: [7-1] <  >LOG:  received promote request
Jan 30 13:47:44 srv11014 postgres[76592]: [4-1] <  >FATAL:  terminating walreceiver process due to administrator command
Jan 30 13:47:44 srv11014 pgsqlms(pgsqld)[76758]: INFO: Waiting for the promote to complete
Jan 30 13:47:44 srv11014 postgres[76492]: [8-1] <  >LOG:  redo is not required
Jan 30 13:47:45 srv11014 postgres[76492]: [9-1] <  >LOG:  selected new timeline ID: 29
Jan 30 13:47:45 srv11014 postgres[76492]: [10-1] <  >LOG:  archive recovery complete
Jan 30 13:47:45 srv11014 pgsqlms(pgsqld)[76758]: INFO: Waiting for the promote to complete
Jan 30 13:47:45 srv11014 postgres[76492]: [11-1] <  >LOG:  MultiXact member wraparound protections are now enabled
Jan 30 13:47:45 srv11014 postgres[76589]: [3-1] <  >LOG:  checkpoint starting: force
Jan 30 13:47:45 srv11014 postgres[76460]: [4-1] <  >LOG:  database system is ready to accept connections
Jan 30 13:47:45 srv11014 postgres[76955]: [4-1] <  >LOG:  autovacuum launcher started
Jan 30 13:47:45 srv11014 postgres[76589]: [4-1] <  >LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.225 s, sync=0.001 s, total=0.242 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
Jan 30 13:47:46 srv11014 pgsqlms(pgsqld)[76758]: INFO: Promote complete
Jan 30 13:47:46 srv11014 crmd[15275]:  notice: Result of promote operation for pgsqld on srv11014.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:46 srv11014 crmd[15275]:  notice: Result of notify operation for pgsqld on srv11014.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:46 srv11014 pgsqlms(pgsqld)[77126]: WARNING: No secondary connected to the primary
Jan 30 13:47:46 srv11014 pgsqlms(pgsqld)[77126]: WARNING: "srv31036.xxxxxxxxxxxx" is not connected to the primary
Jan 30 13:47:47 srv11014 crmd[15275]:  notice: srv11014.xxxxxxxxxxxx-pgsqld_monitor_15000:77 [ /tmp:5432 - accepting connections\n ]
Jan 30 13:47:48 srv11014 healthcheck-pgclu31.services.xxxxxxxxxxxx [173794]: service up, restoring loopback ips
Jan 30 13:47:49 srv11014 healthcheck-pgclu31.services.xxxxxxxxxxxx [173794]: send announces for UP state to ExaBGP
[...]
Jan 30 13:47:55 srv11014 postgres[79389]: [5-1] <srv31036.xxxxxxxxxxxx [unknown] pgrepli>LOG:  standby "srv31036.xxxxxxxxxxxx" is now a synchronous standby with priority 2
[...]
Jan 30 13:48:02 srv11014 pgsqlms(pgsqld)[80212]: INFO: Update score of "srv31036.xxxxxxxxxxxx" from -1000 to 1000 because of a change in the replication lag (104680).

Syslog slave async (srv31036)

Jan 30 13:46:04 srv31036 corosync[224776]: [TOTEM ] A processor failed, forming new configuration.
Jan 30 13:46:06 srv31036 healthcheck-pgclu31.xxxxxxxxxxxx[127933]: service down, deleting loopback ips
Jan 30 13:46:06 srv31036 healthcheck-pgclu31.xxxxxxxxxxxx[127933]: send announces for DOWN state to ExaBGP
Jan 30 13:46:12 srv31036 sensu-agent: {"component":"command","error":"operation not permitted","level":"error","msg":"Execution timed out - Unable to TERM/KILL the process: #149768","time":"2023-01-30T13:46:12Z"}
Jan 30 13:46:12 srv31036 sensu-agent: {"check":"metric-hacluster","component":"command","level":"warning","msg":"check or hook execution created zombie process - escaping in order for the check to execute again","time":"2023-01-30T13:46:12Z"}
Jan 30 13:46:17 srv31036 corosync[224776]: [TOTEM ] A new membership (x.x.x.x:xxx) was formed. Members left: 2
Jan 30 13:46:17 srv31036 corosync[224776]: [TOTEM ] Failed to receive the leave message. failed: 2
Jan 30 13:46:17 srv31036 cib[224808]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv31036 cib[224808]:  notice: Purged 1 peers with id=2 and/or uname=srv21014.xxxxxxxxxxxx from the membership cache
Jan 30 13:46:17 srv31036 stonith-ng[224809]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv31036 crmd[224813]:  notice: Our peer on the DC (srv21014.xxxxxxxxxxxx) is dead
Jan 30 13:46:17 srv31036 crmd[224813]:  notice: State transition S_NOT_DC -> S_ELECTION
Jan 30 13:46:17 srv31036 stonith-ng[224809]:  notice: Purged 1 peers with id=2 and/or uname=srv21014.xxxxxxxxxxxx from the membership cache
Jan 30 13:46:17 srv31036 attrd[224811]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv31036 attrd[224811]:  notice: Removing all srv21014.xxxxxxxxxxxx attributes for peer loss
Jan 30 13:46:17 srv31036 attrd[224811]:  notice: Lost attribute writer srv21014.xxxxxxxxxxxx
Jan 30 13:46:17 srv31036 attrd[224811]:  notice: Purged 1 peers with id=2 and/or uname=srv21014.xxxxxxxxxxxx from the membership cache
Jan 30 13:46:17 srv31036 corosync[224776]: [QUORUM] Members[2]: 1 3
Jan 30 13:46:17 srv31036 crmd[224813]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv31036 corosync[224776]: [MAIN  ] Completed service synchronization, ready to provide service.
Jan 30 13:46:17 srv31036 pacemakerd[224804]:  notice: Node srv21014.xxxxxxxxxxxx state is now lost
Jan 30 13:46:17 srv31036 crmd[224813]:  notice: State transition S_ELECTION -> S_INTEGRATION
Jan 30 13:46:17 srv31036 crmd[224813]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check
Jan 30 13:46:18 srv31036 pengine[224812]:  notice: Watchdog will be used via SBD if fencing is required
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Node srv21014.xxxxxxxxxxxx will be fenced because the node is no longer part of the cluster
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Node srv21014.xxxxxxxxxxxx is unclean
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Processing failed op monitor for pgsqld:0 on srv11014.xxxxxxxxxxxx: unknown error (1)
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Action pgsqld:1_stop_0 on srv21014.xxxxxxxxxxxx is unrunnable (offline)
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Action pgsqld:1_stop_0 on srv21014.xxxxxxxxxxxx is unrunnable (offline)
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Action pgsqld:1_stop_0 on srv21014.xxxxxxxxxxxx is unrunnable (offline)
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Action pgsqld:1_stop_0 on srv21014.xxxxxxxxxxxx is unrunnable (offline)
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Scheduling Node srv21014.xxxxxxxxxxxx for STONITH
Jan 30 13:46:18 srv31036 pengine[224812]:  notice: Recover pgsqld:0#011(Slave srv11014.xxxxxxxxxxxx)
Jan 30 13:46:18 srv31036 pengine[224812]:  notice: Promote pgsqld:0#011(Slave -> Master srv11014.xxxxxxxxxxxx)
Jan 30 13:46:18 srv31036 pengine[224812]:  notice: Stop    pgsqld:1#011(srv21014.xxxxxxxxxxxx)
Jan 30 13:46:18 srv31036 pengine[224812]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-8.bz2
Jan 30 13:46:18 srv31036 crmd[224813]:  notice: Requesting fencing (reboot) of node srv21014.xxxxxxxxxxxx
Jan 30 13:46:18 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_pre_notify_stop_0 on srv11014.xxxxxxxxxxxx
Jan 30 13:46:18 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_pre_notify_stop_0 locally on srv31036.xxxxxxxxxxxx
Jan 30 13:46:18 srv31036 stonith-ng[224809]:  notice: Client crmd.224813.eb497a35 wants to fence (reboot) 'srv21014.xxxxxxxxxxxx' with device '(any)'
Jan 30 13:46:18 srv31036 stonith-ng[224809]:  notice: Requesting peer fencing (reboot) of srv21014.xxxxxxxxxxxx
Jan 30 13:46:18 srv31036 stonith-ng[224809]:  notice: Couldn't find anyone to fence (reboot) srv21014.xxxxxxxxxxxx with any device
Jan 30 13:46:18 srv31036 stonith-ng[224809]:  notice: Waiting 10s for srv21014.xxxxxxxxxxxx to self-fence (reboot) for crmd.224813.837edffe ((nil))
Jan 30 13:46:18 srv31036 crmd[224813]:  notice: Result of notify operation for pgsqld on srv31036.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:46:28 srv31036 stonith-ng[224809]:  notice: Self-fencing (reboot) by srv21014.xxxxxxxxxxxx for crmd.224813.837edffe-8e3d-4fdf-87ed-d912c8e17885 assumed complete
Jan 30 13:46:28 srv31036 stonith-ng[224809]:  notice: Operation reboot of srv21014.xxxxxxxxxxxx by srv31036.xxxxxxxxxxxx for [email protected]: OK
Jan 30 13:46:28 srv31036 crmd[224813]:  notice: Stonith operation 2/35:0:0:8f44efb5-0fb8-4cb2-98c6-05df430f5543: OK (0)
Jan 30 13:46:28 srv31036 crmd[224813]:  notice: Peer srv21014.xxxxxxxxxxxx was terminated (reboot) by srv31036.xxxxxxxxxxxx for srv31036.xxxxxxxxxxxx: OK (ref=837edffe-8e3d-4fdf-87ed-d912c8e17885) by client crmd.224813
Jan 30 13:46:28 srv31036 crmd[224813]:  notice: Initiating stop operation pgsqld_stop_0 on srv11014.xxxxxxxxxxxx
[...]
Jan 30 13:47:15 srv31036 postgres[225351]: [414523-1] <  >LOG:  restartpoint complete: wrote 248212 buffers (3.8%); 0 transaction log file(s) added, 0 removed, 13 recycled; write=107.505 s, sync=0.028 s, total=107.569 s; sync files=2314, longest=0.003 s, average=0.001 s; distance=1033026 kB, estimate=1454494 kB
Jan 30 13:47:15 srv31036 postgres[225351]: [414524-1] <  >LOG:  recovery restart point at 65D4D/F29D3ED0
Jan 30 13:47:15 srv31036 postgres[225351]: [414524-2] <  >DETAIL:  last completed transaction was at log time 2023-01-30 13:46:22.995287+00
[...]
Jan 30 13:47:23 srv31036 postgres[252372]: [27-1] <  >FATAL:  terminating walreceiver due to timeout
Jan 30 13:47:23 srv31036 postgres[225224]: [1696-1] <  >LOG:  restored log file "0000001C00065D4E0000002B" from archive
[...]
Jan 30 13:47:40 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_post_notify_stop_0 locally on srv31036.xxxxxxxxxxxx
Jan 30 13:47:40 srv31036 crmd[224813]:  notice: Result of notify operation for pgsqld on srv31036.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:40 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_pre_notify_start_0 locally on srv31036.xxxxxxxxxxxx
Jan 30 13:47:40 srv31036 crmd[224813]:  notice: Result of notify operation for pgsqld on srv31036.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:40 srv31036 crmd[224813]:  notice: Initiating start operation pgsqld_start_0 on srv11014.dxxxxxxxxxxxx
Jan 30 13:47:41 srv31036 healthcheck-pgclu31.services.xxxxxxxxxxxx [127933]: Check command was unsuccessful: 1
Jan 30 13:47:41 srv31036 healthcheck-pgclu31.services.xxxxxxxxxxxx [127933]: service down, deleting loopback ips
Jan 30 13:47:41 srv31036 healthcheck-pgclu31.services.xxxxxxxxxxxx [127933]: send announces for DOWN state to ExaBGP
Jan 30 13:47:43 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_post_notify_start_0 on srv11014.xxxxxxxxxxxx
Jan 30 13:47:43 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_post_notify_start_0 locally on srv31036.xxxxxxxxxxxx
Jan 30 13:47:43 srv31036 crmd[224813]:  notice: Result of notify operation for pgsqld on srv31036.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:43 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_pre_notify_promote_0 on srv11014.xxxxxxxxxxxx
Jan 30 13:47:43 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_pre_notify_promote_0 locally on srv31036.xxxxxxxxxxxx
Jan 30 13:47:43 srv31036 pgsqlms(pgsqld)[167930]: INFO: Promoting instance on node "srv11014.xxxxxxxxxxxx"
Jan 30 13:47:43 srv31036 postgres[225351]: [414526-1] <  >LOG:  restartpoint complete: wrote 19948 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=16.638 s, sync=0.010 s, total=16.656 s; sync files=915, longest=0.005 s, average=0.001 s; distance=819311 kB, estimate=1390976 kB
Jan 30 13:47:43 srv31036 postgres[225351]: [414527-1] <  >LOG:  recovery restart point at 65D4E/249EFCD8
Jan 30 13:47:43 srv31036 postgres[225351]: [414527-2] <  >DETAIL:  last completed transaction was at log time 2023-01-30 13:46:22.995287+00
Jan 30 13:47:43 srv31036 pgsqlms(pgsqld)[167930]: INFO: Current node TL#LSN: 28#1791440183113144
Jan 30 13:47:43 srv31036 crmd[224813]:  notice: Result of notify operation for pgsqld on srv31036.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:43 srv31036 crmd[224813]:  notice: Initiating promote operation pgsqld_promote_0 on srv11014.xxxxxxxxxxxx
Jan 30 13:47:46 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_post_notify_promote_0 on srv11014.xxxxxxxxxxxx
Jan 30 13:47:46 srv31036 crmd[224813]:  notice: Initiating notify operation pgsqld_post_notify_promote_0 locally on srv31036.xxxxxxxxxxxx
Jan 30 13:47:46 srv31036 crmd[224813]:  notice: Result of notify operation for pgsqld on srv31036.xxxxxxxxxxxx: 0 (ok)
Jan 30 13:47:46 srv31036 crmd[224813]:  notice: Initiating monitor operation pgsqld_monitor_15000 on srv11014.xxxxxxxxxxxx
Jan 30 13:47:46 srv31036 crmd[224813]:  notice: Transition aborted by nodes-3-master-pgsqld doing modify master-pgsqld=-1000: Configuration change
Jan 30 13:47:47 srv31036 crmd[224813]:  notice: Transition 0 (Complete=38, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-8.bz2): Complete
Jan 30 13:47:47 srv31036 pengine[224812]:  notice: Watchdog will be used via SBD if fencing is required
Jan 30 13:47:47 srv31036 pengine[224812]: warning: Processing failed op monitor for pgsqld:0 on srv11014.xxxxxxxxxxxx: unknown error (1)
Jan 30 13:47:47 srv31036 pengine[224812]:  notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-3600.bz2
Jan 30 13:47:47 srv31036 crmd[224813]:  notice: Transition 1 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3600.bz2): Complete
Jan 30 13:47:47 srv31036 crmd[224813]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jan 30 13:47:55 srv31036 postgres[164894]: [37-1] <  >LOG:  fetching timeline history file for timeline 29 from primary server
Jan 30 13:47:55 srv31036 postgres[164894]: [38-1] <  >LOG:  started streaming WAL from primary at 65D4E/2C000000 on timeline 28
Jan 30 13:47:55 srv31036 postgres[164894]: [39-1] <  >LOG:  replication terminated by primary server
Jan 30 13:47:55 srv31036 postgres[164894]: [39-2] <  >DETAIL:  End of WAL reached on timeline 28 at 65D4E/2C000098.
Jan 30 13:47:55 srv31036 postgres[225224]: [1697-1] <  >LOG:  restored log file "0000001D.history" from archive
Jan 30 13:47:55 srv31036 postgres[225224]: [1698-1] <  >LOG:  restored log file "0000001D.history" from archive
Jan 30 13:47:55 srv31036 postgres[225224]: [1699-1] <  >LOG:  new target timeline is 29
Jan 30 13:47:56 srv31036 postgres[225224]: [1700-1] <  >LOG:  invalid resource manager ID 24 at 65D4E/2C000098
Jan 30 13:47:56 srv31036 postgres[164894]: [40-1] <  >LOG:  restarted WAL streaming at 65D4E/2C000000 on timeline 29
Jan 30 13:47:55 srv31036 postgres[164894]: [37-1] <  >LOG:  fetching timeline history file for timeline 29 from primary server
Jan 30 13:47:55 srv31036 postgres[164894]: [38-1] <  >LOG:  started streaming WAL from primary at 65D4E/2C000000 on timeline 28
Jan 30 13:47:55 srv31036 postgres[164894]: [39-1] <  >LOG:  replication terminated by primary server
Jan 30 13:47:55 srv31036 postgres[164894]: [39-2] <  >DETAIL:  End of WAL reached on timeline 28 at 65D4E/2C000098.
Jan 30 13:47:55 srv31036 postgres[225224]: [1697-1] <  >LOG:  restored log file "0000001D.history" from archive
Jan 30 13:47:55 srv31036 postgres[225224]: [1698-1] <  >LOG:  restored log file "0000001D.history" from archive
Jan 30 13:47:55 srv31036 postgres[225224]: [1699-1] <  >LOG:  new target timeline is 29
Jan 30 13:47:56 srv31036 postgres[225224]: [1700-1] <  >LOG:  invalid resource manager ID 24 at 65D4E/2C000098
Jan 30 13:47:56 srv31036 postgres[164894]: [40-1] <  >LOG:  restarted WAL streaming at 65D4E/2C000000 on timeline 29
Jan 30 13:48:02 srv31036 crmd[224813]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Jan 30 13:48:02 srv31036 pengine[224812]:  notice: Watchdog will be used via SBD if fencing is required
Jan 30 13:48:02 srv31036 pengine[224812]: warning: Processing failed op monitor for pgsqld:0 on srv11014.xxxxxxxxxxxx: unknown error (1)
Jan 30 13:48:02 srv31036 crmd[224813]:  notice: Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3601.bz2): Complete
Jan 30 13:48:02 srv31036 pengine[224812]:  notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-3601.bz2
Jan 30 13:48:02 srv31036 crmd[224813]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE

Syslog slave Sync (srv21014)

Well... none ^^

I can provide more, let me know which one you need

My questions

Related to the root cause
Could you explain why the master's pgsqld monitor timed out ?
Or could you suggest what to monitor or trace deeper (if possible) ?
We have bunch of dashboards, maybe some intels are here but I didn't correlate ; feel free to ask, if we have it, i may share :)
Related to prevent it in the future
I'm considering tuning the wal sender/receiver timeout down to 9s (1s lower than the timeout of pcs resource). Do you think it could avoid "safely" such behavior or would I face some side-effects/risks i may have missed there ? I would except in this use case the sync to be on srv31036 around 13:46:16.

Kind Regards
Arpad

The text was updated successfully, but these errors were encountered:

ioguix · 2023-02-07T15:05:48Z

Hi @asphator,

This is definitely something I'm interested to investigate. Unfortunately, I'm currently short in time...

I hope I'll be able to parse all your data and logs soon!

Cheers,

asphator · 2023-02-07T16:45:42Z

Hi @ioguix
Thx for reply. I'm not in a rush :)

asphator · 2024-04-11T16:08:36Z

Hi there
My issue still occurs from time to time on some clusters, couldn't figure out why so far :/

asphator · 2024-11-06T08:55:19Z

Hi @ioguix
We've faced this issue again on another cluster (where it never occurred so far).
Patching is up to date, load was normal, no network nor storage issue, logs are still unclear on why the monitor timed out, it just comes out from nowhere.
My guts tells me it's likely memory related but dashboards do not show up anything either, no bottleneck nor threshold reached, shared_buffers normal, ... i'm running out of ideas to understand and prevent this issue.

ioguix · 2024-11-06T17:05:10Z

Hi @asphator,

So your logs shows both the walsender and walreceiver are shutting down on timeout (60s) some time after the monitoring action timing out as well (10s). So something seems fishy on the network side…

My guts tells me either the monitoring action timeout because of some TCP/network issues slowing down some cluster relative communications, or the query on pg_stat_replication waiting for a LWLock because the walsender is somehow locked. Both might be hard to detect/prove and I don't really buy them myself yet.

Maybe you could:

rise the monitor action timeout
lower the walsender/walreceiver timeouts bellow the monitor action timeout
adjust your kernel TCP keepalive setting to be more aggressive and detect dead TCP session before your various timeouts
a mix of all of these setups

asphator changed the title ~~pgqld monitor timed out on master when sync slave crashed~~ pgsqld monitor timed out on master when sync slave crashed Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgsqld monitor timed out on master when sync slave crashed #218

pgsqld monitor timed out on master when sync slave crashed #218

asphator commented Jan 31, 2023 •

edited

Loading

ioguix commented Feb 7, 2023

asphator commented Feb 7, 2023

asphator commented Apr 11, 2024

asphator commented Nov 6, 2024

ioguix commented Nov 6, 2024

pgsqld monitor timed out on master when sync slave crashed #218

pgsqld monitor timed out on master when sync slave crashed #218

Comments

asphator commented Jan 31, 2023 • edited Loading

Our issue

Our setup

Timeline

Some logs

My questions

ioguix commented Feb 7, 2023

asphator commented Feb 7, 2023

asphator commented Apr 11, 2024

asphator commented Nov 6, 2024

ioguix commented Nov 6, 2024

asphator commented Jan 31, 2023 •

edited

Loading