-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop of rabbit app within start_rmq_server_app (OCF rabbitmq-server-ha
)
#1833
Comments
rabbitmq-server-ha
)
Firstly, thank you for using this agent and taking care of its health! In the repositroy from which this OCF agent originates (now in openstack-archive), there had been a related change and the corresponding gerrit change. There are some related LP bugs linked in the commit message for more context. For the record: setting So, as the follow-up fix clarifies, we want to stop to only test it, if the app can be started "for real". There had been some corner cases around application reports started, but in fact is not functioning properly. The linked lp bug explain that in details. FWIW, we want to make sure that the app can be stopped w/o errors, after we have started it. And if it cannot, the mnesia DB will be cleaned up, so that for the next time pacemaker runs monitor or processes other events, it should start w/o problems (most likely!) |
By the way, there is some automation around customized Jepsen tests, which I used to run from time to time in a fork of rabbitmq-server repo, via github actions It used to always reassemble the cluster upon network partitions caused by the testing framework, and allowed the test to complete. I no longer maintain that automation and fork, as we moved the script from rabbitmq-server repo to this new home. Having that jepsen-CI around here could be a good idea... |
this could be a valid issue, and would also explain suboptimal testing results in Jepsen (many pending messages) |
Thanks for the quick reply!
Resetting the mnesia DB means losing all durable exchanges/queues and the data in those, doesn't it? If there is a cluster outage because of power loss, it means all data that has not been processed is lost upon recovery? Regarding my other scenario:
I need to investigate this further. The remaining node receives a notify followed by a demote action by the DC as soon as the "old" master reboots. Thanks for the link to the automation repo - I'll check this out! |
Resetting mnesia DB is the standard handling for the start/stop/join and the like unrecoverable failures. When using HA queues, or raft queues (which also requires durable queues), perhaps the data loss can be minimized. |
Hi,
in the OCF file rabbitmq-server-ha, I don't understand why after a successful start, the function
stop_rmq_server_app
is called.resource-agents/heartbeat/rabbitmq-server-ha
Line 1402 in 50b6cd7
As seen in snippet below, why would I want to stop the rmq server app when it just was started successfully as master of cluster:
Clearly I am missing something, could someone please explain why it is done this way? @bogdando maybe can you help me out here?
We are using OCF rabbitmq-server-ha within a pacemaker cluster of 3 nodes and experience slow starts and a somewhat strange master election of the RabbitMQ master (newly booted node tears down active master and starts its own promotion...)
The text was updated successfully, but these errors were encountered: