Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resuming state without inline verifier #185

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kolbitsch-lastline
Copy link

If the inline-verifier is not enabled (as is the case for various
production uses of ghostferry), its binlog position can grow stale. In
some cases it points to such an old position that resuming from it
fails (if the source has already deleted such old replication logs).

This commit fixes this by relying solely on the binlog writer resume
position if the inline verifier is not enabled.

We still fail if the inline verifier is enabled and the position is
stale, but there is nothing one can do about that. If verification is
enabled, one must ensure that it's able to keep up with migration.

This fixes #184

If the inline-verifier is not enabled (as is the case for various
production uses of ghostferry), its binlog position can grow stale. In
some cases it points to such an old position that resuming from it
fails (if the source has already deleted such old replication logs).

This commit fixes this by relying solely on the binlog writer resume
position if the inline verifier is not enabled.

We still fail if the inline verifier *is* enabled and the position is
stale, but there is nothing one can do about that. If verification is
enabled, one must ensure that it's able to keep up with migration.

This fixes Shopify#184

Change-Id: I24cef00bb78c266705107b2b8b4008171186940d
@shuhaowu
Copy link
Contributor

Thanks for the PR! This feels like a problem somewhere else tho, since we have a guard for this condition in both of these places:

if s.LastStoredBinlogPositionForInlineVerifier == nilPosition {
return s.LastWrittenBinlogPosition
}
and

ghostferry/ferry.go

Lines 539 to 541 in 50a7760

if f.inlineVerifier != nil {
f.StateTracker.UpdateLastResumableSourceBinlogPositionForInlineVerifier(sourcePos)
}

@kolbitsch-lastline
Copy link
Author

thanks for the quick feedback!

yeah, I also considered that approach, but I felt that adding more checks in multiple places isn't the cleanest solution.

On start, you can't really fix the problem (here

ghostferry/ferry.go

Lines 539 to 541 in 50a7760

if f.inlineVerifier != nil {
f.StateTracker.UpdateLastResumableSourceBinlogPositionForInlineVerifier(sourcePos)
}
), as we need to continuously update that counter, but if the inline verifier is not running, that code doesn't even get event notifications.

I considered moving it into MinSourceBinlogPosition but that felt like propagating a flag into a method that doesn't need to care about it (hence I moved the if into the call).

We could also register a "dummy event listener" that simply sets the last-seen binlog position for the inline verifier. This way one could resume the verifier at any point. Not sure if that's cleaner than the simple if though.

@hkdsun
Copy link
Contributor

hkdsun commented May 15, 2020

From what I'm reading, if you start the ferry with the inline verifier disabled, we never update the state tracker's lastStoredBinlogPositionForInlineVerifier.. The only possible way to get here is then to start a ferry with the verifier enabled, interrupt, and then resume with it disabled.

Is there failing test or repro steps you can demonstrate?

@hkdsun
Copy link
Contributor

hkdsun commented May 15, 2020

I think that is what happened... based on #184 .

How about instead of this change, we clear the state from the SerializedState?

That is, if across invocations you disable the verifier, we clear all its previous state.

And this won't reach into the MinSourceBinlogPosition implementation.

Something like:

	if f.StateToResumeFrom != nil {
		if f.inlineVerifier == nil {
			f.StateToResumeFrom.LastStoredBinlogPositionForInlineVerifier = siddontangmysql.Position{}
		}
		sourcePos, err = f.BinlogStreamer.ConnectBinlogStreamerToMysqlFrom(f.StateToResumeFrom.MinSourceBinlogPosition())
	} else {

@kolbitsch-lastline
Copy link
Author

kolbitsch-lastline commented May 15, 2020

How about instead of this change, we clear the state from the SerializedState?

That is, if across invocations you disable the verifier, we clear all its previous state.

that a good suggestion, but there is a reason I did not follow this approach: conceptually, it is very nice to have the possibility to temporarily disable the verifier. If we were to clear the state, that means we cannot pick back up where we had left off.

Furthermore, in many cases, there is actually no need to clear the state, because resuming later is perfectly safe. The problem arises only once the source DB can no-longer serve binlog data starting at the resume point. Sure, sooner or later this will always happen, but if you don't disable the verifier for (typically) weeks, there is no harm.

Thus, the current implementation seems more generic and safer, but I don't feel strongly about this at all and am happy to change how this happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resuming ghostferry may fail due to inactive verifier
3 participants