-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pgsqld monitor timed out on master when sync slave crashed #218
Comments
Hi @asphator, This is definitely something I'm interested to investigate. Unfortunately, I'm currently short in time... I hope I'll be able to parse all your data and logs soon! Cheers, |
Hi @ioguix |
Hi there |
Hi @ioguix |
Hi @asphator, So your logs shows both the walsender and walreceiver are shutting down on timeout (60s) some time after the monitoring action timing out as well (10s). So something seems fishy on the network side… My guts tells me either the monitoring action timeout because of some TCP/network issues slowing down some cluster relative communications, or the query on Maybe you could:
|
Hi there
Yesterday, I had an issue I cannot fully explain (it did occur only once before, like 2 years ago; at the time, already couldn't figure out what was the root cause).
The closest topic related I could find would be issue #107 but with no satisfying explanation.
Our issue
A guy from another SRE did a mistake and stopped the wrong server through iDrac - an electrical poweroff - followed by an OS reinstallation. Unfortunately, it was the sync slave of one of our clusters.
Setup should be resilient to the loss of a slave but here, the monitor of pgsqld timed out and thus, master process restarted by pcs.
It has immediately and successfully restarted with 2nd slave (the async one) quickly up to date.
BUT I'm unable to understand nor reproduce this behavior. During our tests and benches, the expected behavior occurs: sync switches on 3rd node and master lives on (after a short freeze).
Of course, it happened during the day, with some load / locks on the databases. We didn't reach max connection though but active connections were hanging the minute before the restart.
Our setup
Infrastructure
Postgres
As you guess, node 3 is 2 countries away while 1 and 2 are both around Paris
If you need more, just let me know which ones.
Cluster
Nominal state
Timeline
Some logs
Syslog master (srv11014)
Syslog slave async (srv31036)
Syslog slave Sync (srv21014)
I can provide more, let me know which one you need
My questions
Could you explain why the master's pgsqld monitor timed out ?
Or could you suggest what to monitor or trace deeper (if possible) ?
We have bunch of dashboards, maybe some intels are here but I didn't correlate ; feel free to ask, if we have it, i may share :)
I'm considering tuning the wal sender/receiver timeout down to 9s (1s lower than the timeout of pcs resource). Do you think it could avoid "safely" such behavior or would I face some side-effects/risks i may have missed there ? I would except in this use case the sync to be on srv31036 around 13:46:16.
Kind Regards
Arpad
The text was updated successfully, but these errors were encountered: