Stop killing deleted connections in a replicaSet #236

dmsnell · 2021-02-13T06:56:04Z

Status

Update version number and rebase against any latest updates in master
Verify that changes from killing to shutting down don't break anything

Resolves #226

When stopping connections to "deleted" servers use stop(Pid, normal)
instead of stop(Pid, kill), since killing crashes and will cascade
up to bring down the supervisor.

This behavior is reproducible when connecting to a replica set with a connection
string that doesn't match the hostname and port that Mongo sees for itself, such
as when running in a Docker environment.

As soon as mc_monitor:check/2 reports the set of hosts for the replica set they
will come back different than what was used to connect, e.g. "mongo.localhost:27017"
vs. "localhost:27020" (if port-mapping is used with Docker), and so the driver will
think that the "localhost:27020" server was deleted from the cluster and then kill
the connections.

By not killing the connections the driver continues to operate stably.

❓ why are these connections being killed in the first place? would we really
want to crash the entire driver, connection pool, and database supervisor just because
the topology changed? (or in this case, was reporting differently than how we connected
to it in the first place)

❓ other responses in mc_topology:update_topology_state/2 also kill their connections.
should those be updated to stop(Pid, normal) as well? is there a need to communicate
these shutdowns to the calling application? could that be done without a crash? (the crashes
not only take down the supervisor but also flood the system crash logs, as noted in #224 and #225)

dmsnell · 2021-02-17T20:13:48Z

cc: @comtihon for your thoughts

comtihon · 2021-02-18T09:14:29Z

hi,

why are these connections being killed in the first place? would we really
want to crash the entire driver, connection pool, and database supervisor just because
the topology changed?

no, it was a mistake.

other responses in mc_topology:update_topology_state/2 also kill their connections.

which other responses do you mean?

dmsnell · 2021-02-23T18:44:23Z

@comtihon sorry for my delayed response - I was pulled into some urgent issues.

most of the updates in update_topology_state kill their connection processes. some of these are like replicaSetNoPrimary and unknown but it looks like there are eight cases where this is happening.

my guess is that none would present themselves as circumstances to kill the connection (vs exit(Pid, normal)) but my knowledge of mongo is lacking here to know if those are truly exceptional never-should-exist circumstances or those are somewhat common this-happens-from-time-to-time-in-normal-circumstances.

comtihon · 2021-02-25T08:30:43Z

hi @dmsnell , thank you for the response.
I also don't remember why it was implemented this way.
Can you please change it to normal everywhere?
(and bump the latest version, please make sure to rebase on the latest master, as another PR was merged with version update)

comtihon · 2021-03-15T07:52:57Z

and version please?

dmsnell · 2021-03-15T20:46:10Z

right, sorry @comtihon - I updated this partially to try and fix an issue we saw in our app, getting into a crash-loop on boot but only when adding both nodes in a replicaSet cluster in the config (with only the primary server, or at least, with only one node listed in the config it was fine)

I will @-mention you when I have the rest done. I've added a few todo items in the status to try and better communicate where it's at.

Resolves comtihon#226 When stopping connections to "deleted" servers use `stop(Pid, normal)` instead of `stop(Pid, kill)`, since killing crashes and will cascade up to bring down the supervisor.

In this patch I've bumped the version number as a minor update. It was unclear whether this was a major or minor update because while it changes the behavior of the topology state when server configs change it doesn't do anything but prevent crashing the app. For applications which depend on the crashing behavior (hopefully there are none) then this is a major change.

dmsnell · 2021-05-24T22:39:11Z

@comtihon I think this is as finished as it's going to get right now. I was hoping to have more thorough production testing but we've had this code running for a while and haven't seen the crashes that it fixes. I still have some conceptual questions about the topology state but maybe we can get this one in now and come back later if we need.

Version bump is in there and all connections now stop normally instead of being killed.

comtihon · 2021-05-26T06:51:48Z

Thank you very much for the contribution!

dmsnell added 3 commits May 24, 2021 15:35

Stop killing deleted connections in a replicaSet

519980c

Resolves comtihon#226 When stopping connections to "deleted" servers use `stop(Pid, normal)` instead of `stop(Pid, kill)`, since killing crashes and will cascade up to bring down the supervisor.

Stop killall updated replicaset members

0dc587c

dmsnell force-pushed the fix/stop-killing-deleted-databases branch from 9e451e6 to 46c34c1 Compare May 24, 2021 22:37

comtihon merged commit 4d70b9d into comtihon:master May 26, 2021

This was referenced May 26, 2021

Generic server terminating #224

Open

Failure to start if list of replicaSet seeds needs to be resolved to hostnames #160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop killing deleted connections in a replicaSet #236

Stop killing deleted connections in a replicaSet #236

dmsnell commented Feb 13, 2021 •

edited

Loading

dmsnell commented Feb 17, 2021

comtihon commented Feb 18, 2021

dmsnell commented Feb 23, 2021

comtihon commented Feb 25, 2021

comtihon commented Mar 15, 2021

dmsnell commented Mar 15, 2021

dmsnell commented May 24, 2021

comtihon commented May 26, 2021

Stop killing deleted connections in a replicaSet #236

Stop killing deleted connections in a replicaSet #236

Conversation

dmsnell commented Feb 13, 2021 • edited Loading

Status

dmsnell commented Feb 17, 2021

comtihon commented Feb 18, 2021

dmsnell commented Feb 23, 2021

comtihon commented Feb 25, 2021

comtihon commented Mar 15, 2021

dmsnell commented Mar 15, 2021

dmsnell commented May 24, 2021

comtihon commented May 26, 2021

dmsnell commented Feb 13, 2021 •

edited

Loading