Stop of rabbit app within start_rmq_server_app (OCF `rabbitmq-server-ha`) #1833

f-schie · 2023-01-11T10:33:57Z

Hi,

in the OCF file rabbitmq-server-ha, I don't understand why after a successful start, the function stop_rmq_server_app is called.

resource-agents/heartbeat/rabbitmq-server-ha

Line 1402 in 50b6cd7

stop_rmq_server_app

As seen in snippet below, why would I want to stop the rmq server app when it just was started successfully as master of cluster:

    if [ $rc -eq $OCF_SUCCESS ] ; then
        # rabbitmq-server started successfuly as master of cluster
        master_score $MIN_MASTER_SCORE
        stop_rmq_server_app
        rc=$?
        if [ $rc -ne 0 ] ; then
            ocf_log err "${LH} RMQ-server app can't be stopped. Beam will be killed."
            kill_rmq_and_remove_pid
            unblock_client_access "${LH}"
            return $OCF_ERR_GENERIC
        fi

Clearly I am missing something, could someone please explain why it is done this way? @bogdando maybe can you help me out here?

We are using OCF rabbitmq-server-ha within a pacemaker cluster of 3 nodes and experience slow starts and a somewhat strange master election of the RabbitMQ master (newly booted node tears down active master and starts its own promotion...)

The text was updated successfully, but these errors were encountered:

bogdando · 2023-01-11T12:30:46Z

Firstly, thank you for using this agent and taking care of its health!

In the repositroy from which this OCF agent originates (now in openstack-archive), there had been a related change and the corresponding gerrit change. There are some related LP bugs linked in the commit message for more context.

For the record: setting master_score 1 (minimal positive master-score for this node) means that the application is stopping on a non-master node (all of them). The master takes master_score 1000 normally, and the node which never should be promoted takes the score 0.

So, as the follow-up fix clarifies, we want to stop to only test it, if the app can be started "for real". There had been some corner cases around application reports started, but in fact is not functioning properly. The linked lp bug explain that in details. FWIW, we want to make sure that the app can be stopped w/o errors, after we have started it. And if it cannot, the mnesia DB will be cleaned up, so that for the next time pacemaker runs monitor or processes other events, it should start w/o problems (most likely!)

bogdando · 2023-01-11T12:48:03Z

By the way, there is some automation around customized Jepsen tests, which I used to run from time to time in a fork of rabbitmq-server repo, via github actions

It used to always reassemble the cluster upon network partitions caused by the testing framework, and allowed the test to complete. I no longer maintain that automation and fork, as we moved the script from rabbitmq-server repo to this new home. Having that jepsen-CI around here could be a good idea...

bogdando · 2023-01-11T12:57:39Z

newly booted node tears down active master and starts its own promotion

this could be a valid issue, and would also explain suboptimal testing results in Jepsen (many pending messages)

f-schie · 2023-01-11T16:22:31Z

Thanks for the quick reply!
So if I am understanding correctly, it is OK (and partially expected) to have a scenario like this:

Initiate a restart of msRabbitMQ via Pacemaker
RabbitMQ is stopped via action_stop() (which calls stop_server_process())
After successfully stopping everything, action_start() invokes:
start_rmq_server_app() leading to starting the RMQ-server app via try_to_start_rmq_app() as
/usr/sbin/rabbitmqctl start_app is called within and even if successful, stop_app is executed to verify the correct stop/start behavior
If stop_app fails, Mnesia DB is reset and whole process is repeated

Resetting the mnesia DB means losing all durable exchanges/queues and the data in those, doesn't it? If there is a cluster outage because of power loss, it means all data that has not been processed is lost upon recovery?

Regarding my other scenario:

newly booted node tears down active master and starts its own promotion

I need to investigate this further. The remaining node receives a notify followed by a demote action by the DC as soon as the "old" master reboots.

Thanks for the link to the automation repo - I'll check this out!

bogdando · 2023-01-11T16:52:30Z

Resetting mnesia DB is the standard handling for the start/stop/join and the like unrecoverable failures. When using HA queues, or raft queues (which also requires durable queues), perhaps the data loss can be minimized.

f-schie changed the title ~~Stop of rabbit app within start_rmq_server_app~~ Stop of rabbit app within start_rmq_server_app (OCF rabbitmq-server-ha) Jan 11, 2023

f-schie closed this as completed Jan 11, 2023

f-schie reopened this Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop of rabbit app within start_rmq_server_app (OCF `rabbitmq-server-ha`) #1833

Stop of rabbit app within start_rmq_server_app (OCF `rabbitmq-server-ha`) #1833

f-schie commented Jan 11, 2023 •

edited

Loading

bogdando commented Jan 11, 2023 •

edited

Loading

bogdando commented Jan 11, 2023

bogdando commented Jan 11, 2023

f-schie commented Jan 11, 2023

bogdando commented Jan 11, 2023 •

edited

Loading

Stop of rabbit app within start_rmq_server_app (OCF rabbitmq-server-ha) #1833

Stop of rabbit app within start_rmq_server_app (OCF rabbitmq-server-ha) #1833

Comments

f-schie commented Jan 11, 2023 • edited Loading

bogdando commented Jan 11, 2023 • edited Loading

bogdando commented Jan 11, 2023

bogdando commented Jan 11, 2023

f-schie commented Jan 11, 2023

bogdando commented Jan 11, 2023 • edited Loading

Stop of rabbit app within start_rmq_server_app (OCF `rabbitmq-server-ha`) #1833

Stop of rabbit app within start_rmq_server_app (OCF `rabbitmq-server-ha`) #1833

f-schie commented Jan 11, 2023 •

edited

Loading

bogdando commented Jan 11, 2023 •

edited

Loading

bogdando commented Jan 11, 2023 •

edited

Loading