-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resuming state without inline verifier #185
base: main
Are you sure you want to change the base?
Fix resuming state without inline verifier #185
Conversation
If the inline-verifier is not enabled (as is the case for various production uses of ghostferry), its binlog position can grow stale. In some cases it points to such an old position that resuming from it fails (if the source has already deleted such old replication logs). This commit fixes this by relying solely on the binlog writer resume position if the inline verifier is not enabled. We still fail if the inline verifier *is* enabled and the position is stale, but there is nothing one can do about that. If verification is enabled, one must ensure that it's able to keep up with migration. This fixes Shopify#184 Change-Id: I24cef00bb78c266705107b2b8b4008171186940d
Thanks for the PR! This feels like a problem somewhere else tho, since we have a guard for this condition in both of these places: Lines 51 to 53 in 04dda85
Lines 539 to 541 in 50a7760
|
thanks for the quick feedback! yeah, I also considered that approach, but I felt that adding more checks in multiple places isn't the cleanest solution. On start, you can't really fix the problem (here Lines 539 to 541 in 50a7760
I considered moving it into We could also register a "dummy event listener" that simply sets the last-seen binlog position for the inline verifier. This way one could resume the verifier at any point. Not sure if that's cleaner than the simple |
From what I'm reading, if you start the ferry with the inline verifier disabled, we never update the state tracker's Is there failing test or repro steps you can demonstrate? |
I think that is what happened... based on #184 . How about instead of this change, we clear the state from the SerializedState? That is, if across invocations you disable the verifier, we clear all its previous state. And this won't reach into the Something like:
|
that a good suggestion, but there is a reason I did not follow this approach: conceptually, it is very nice to have the possibility to temporarily disable the verifier. If we were to clear the state, that means we cannot pick back up where we had left off. Furthermore, in many cases, there is actually no need to clear the state, because resuming later is perfectly safe. The problem arises only once the source DB can no-longer serve binlog data starting at the resume point. Sure, sooner or later this will always happen, but if you don't disable the verifier for (typically) weeks, there is no harm. Thus, the current implementation seems more generic and safer, but I don't feel strongly about this at all and am happy to change how this happens |
If the inline-verifier is not enabled (as is the case for various
production uses of ghostferry), its binlog position can grow stale. In
some cases it points to such an old position that resuming from it
fails (if the source has already deleted such old replication logs).
This commit fixes this by relying solely on the binlog writer resume
position if the inline verifier is not enabled.
We still fail if the inline verifier is enabled and the position is
stale, but there is nothing one can do about that. If verification is
enabled, one must ensure that it's able to keep up with migration.
This fixes #184