Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop killing deleted connections in a replicaSet #236

Merged
merged 3 commits into from
May 26, 2021

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Feb 13, 2021

Status

  • Update version number and rebase against any latest updates in master
  • Verify that changes from killing to shutting down don't break anything

Resolves #226

When stopping connections to "deleted" servers use stop(Pid, normal)
instead of stop(Pid, kill), since killing crashes and will cascade
up to bring down the supervisor.

This behavior is reproducible when connecting to a replica set with a connection
string that doesn't match the hostname and port that Mongo sees for itself, such
as when running in a Docker environment.

As soon as mc_monitor:check/2 reports the set of hosts for the replica set they
will come back different than what was used to connect, e.g. "mongo.localhost:27017"
vs. "localhost:27020" (if port-mapping is used with Docker), and so the driver will
think that the "localhost:27020" server was deleted from the cluster and then kill
the connections.

By not killing the connections the driver continues to operate stably.

❓ why are these connections being killed in the first place? would we really
want to crash the entire driver, connection pool, and database supervisor just because
the topology changed? (or in this case, was reporting differently than how we connected
to it in the first place)

❓ other responses in mc_topology:update_topology_state/2 also kill their connections.
should those be updated to stop(Pid, normal) as well? is there a need to communicate
these shutdowns to the calling application? could that be done without a crash? (the crashes
not only take down the supervisor but also flood the system crash logs, as noted in #224 and #225)

@dmsnell
Copy link
Contributor Author

dmsnell commented Feb 17, 2021

cc: @comtihon for your thoughts

@comtihon
Copy link
Owner

hi,

why are these connections being killed in the first place? would we really
want to crash the entire driver, connection pool, and database supervisor just because
the topology changed?

no, it was a mistake.

other responses in mc_topology:update_topology_state/2 also kill their connections.

which other responses do you mean?

@dmsnell
Copy link
Contributor Author

dmsnell commented Feb 23, 2021

@comtihon sorry for my delayed response - I was pulled into some urgent issues.

most of the updates in update_topology_state kill their connection processes. some of these are like replicaSetNoPrimary and unknown but it looks like there are eight cases where this is happening.

my guess is that none would present themselves as circumstances to kill the connection (vs exit(Pid, normal)) but my knowledge of mongo is lacking here to know if those are truly exceptional never-should-exist circumstances or those are somewhat common this-happens-from-time-to-time-in-normal-circumstances.

@comtihon
Copy link
Owner

hi @dmsnell , thank you for the response.
I also don't remember why it was implemented this way.
Can you please change it to normal everywhere?
(and bump the latest version, please make sure to rebase on the latest master, as another PR was merged with version update)

@comtihon
Copy link
Owner

and version please?

@dmsnell
Copy link
Contributor Author

dmsnell commented Mar 15, 2021

right, sorry @comtihon - I updated this partially to try and fix an issue we saw in our app, getting into a crash-loop on boot but only when adding both nodes in a replicaSet cluster in the config (with only the primary server, or at least, with only one node listed in the config it was fine)

I will @-mention you when I have the rest done. I've added a few todo items in the status to try and better communicate where it's at.

Resolves comtihon#226

When stopping connections to "deleted" servers use `stop(Pid, normal)`
instead of `stop(Pid, kill)`, since killing crashes and will cascade
up to bring down the supervisor.
In this patch I've bumped the version number as a minor update.
It was unclear whether this was a major or minor update because while
it changes the behavior of the topology state when server configs change
it doesn't do anything but prevent crashing the app.

For applications which depend on the crashing behavior (hopefully there
are none) then this is a major change.
@dmsnell dmsnell force-pushed the fix/stop-killing-deleted-databases branch from 9e451e6 to 46c34c1 Compare May 24, 2021 22:37
@dmsnell
Copy link
Contributor Author

dmsnell commented May 24, 2021

@comtihon I think this is as finished as it's going to get right now. I was hoping to have more thorough production testing but we've had this code running for a while and haven't seen the crashes that it fixes. I still have some conceptual questions about the topology state but maybe we can get this one in now and come back later if we need.

Version bump is in there and all connections now stop normally instead of being killed.

@comtihon comtihon merged commit 4d70b9d into comtihon:master May 26, 2021
@comtihon
Copy link
Owner

Thank you very much for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue connecting to ReplicaSet
2 participants