-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redis-ha-4.27.0 - split brain #283
Comments
+1 |
I've had this too. I've found that when we've added a descheduler to the stack (https://github.com/kubernetes-sigs/descheduler) to balance nodes automatically, this kind of issue will disable the redis service frequently. Can the master allocation be done with kubernetes lease locks? https://kubernetes.io/docs/concepts/architecture/leases/ |
@tschirmer I'm trying to work out why this would happen unless the Is this happening in either of your cases? @tschirmer ?? Because in theory, on first rollout, the first pod should start up and become master, way before -1/-2 start. |
@DandyDeveloper |
Haven't set it to Parallel. I suspect it would be something like, pod when evicted isn't completing the trigger-failover-if-master.sh. We are running it with sentinel, which might add some complexity here. I haven't debugged it yet. So far we're getting a load of issues with the liveness probe not containing the SENTINELAUTH env from the secret, but it's clearly defined in the spec; and a restart of the pod works. It's happening very frequently though, so I'm wondering if there needs to be a grace period defined on startup and shutdown to prevent it both of these things from happening |
I think being able to have separated Statefulsets for redises and sentinels will make this chart more stable and manageable, |
I like the idea of seperate stateful sets, I've been thinking of doing that and making a PR I suspect this is from preStop hooks not firing and completely successfully. trigger-failover-if-master.sh occasionally doesn't run as expected. When we had the descheduler running it was ~2min between turning on and off each pod, and found that every now and again, that would fail. The rate of failure is low, so it's unlikely occur unless you're hammering it (we haven't had an issue with the ah cluster once we turned off the descheduler. |
I wanted to make a PR too. But there are a lot of configs that should propagate this change |
I found that there were a couple things wrong with my setup:
The permissions were the killer, because nothing was failing over on shutdown. I'm half way through writing a leader elector in golang for this based on k8s leases. Got it claiming the lease already. I'm not sure it's totally necessary after we've solved these other issues though. |
specifically. In the stateful set the volume definitions here:
to:
|
Also found that that preStopHook: /readonly-config/..data/trigger-failover-if-master.sh requires SENTINELAUTH, but it's not defined in the env for the redis container
|
^I'd modified the above so I could get some debug data. Along with this in the stateful set:
the |
Found that running preStops would consistently fail.
Found that the Sentinel container had shut down before the command could be executed on the localhost., so it kept getting a failover failed. Changed the sentinel preStop to add in a 10 sec delay to keep it alive while this happened and it seems to work every time now.
|
While this "might" work, it "may not" be consistent, suggest taking a look at my solution instead here: #207 (comment) |
Describe the bug
I deployed the chart with default values. During its explatation we met condition when redis-0 and redis-2 are replicas of redis-1, and redis-1 is replica of redis-0. The split-brain-fix container wasn`t able to fix the problem.
172.20.75.109 - redis-0
172.20.181.236 - redis-1
172.20.198.17 - redis-2
redis-0:
redis-1 (sentinel tries to restart it):
sentinel-1 (leader)
split-brain-fix-1
split-brain-fix-0
To Reproduce
I tried node/pod deletion and redis-cli replicaof with no success to reproduce this bug
Expected behavior
split-brain-fix container should fix even this rare case
Additional context
The scripts logic was broken by inability of sentinel to failover. Maybe script should have additional condition to check the role of potential default master. I will be very apreatiate for any help with this. Please let me know if you need some additional logs/checks
The text was updated successfully, but these errors were encountered: