Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EFM Recovery Service Event and Transaction #440

Merged

Conversation

kc1116
Copy link
Contributor

@kc1116 kc1116 commented Jul 30, 2024

This PR updates the FlowEpoch smart contract to support recovering the network while in Epoch Fallback Mode. It adds a new service event EpochRecover which contains the metadata for the recovery epoch. This metadata is generated out of band using the bootstrap utility util epoch efm-recover-tx-args onflow/flow-go#5576 and submitted to the contract with the recovery_epoch.cdc transaction. The FlowEpoch contract will end the current epoch, start the recovery epoch and store the metadata for the recovery epoch in storage. This metadata will then be emitted to the network during the next heartbeat interval.

Reopening original PR: #420

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying over the main comment from the previous review: #420 (comment)

The second conditional case of the recover_epoch transaction (when unsafeAllowOverwrite is false) doesn't use the recoveryEpochCounter value at all. But if we go down that code path and FlowEpoch.currentEpochCounter != recoveryEpochCounter, we know the recovery process will fail.

So I think we should use recoveryEpochCounter in the second codepath as well. We can explicitly check that FlowEpoch.currentEpochCounter == recoveryEpochCounter, for example as a precondition, and panic if this doesn't hold.

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along nicely!

My main suggestion in this review is to expand the test coverage to cover more edge cases (suggestions enumerated here). The existing tests are quite verbose, so I think it would be worthwhile to
invest time in factoring out some of the common test logic when adding test cases. After we get Josh's input on the implementation changes, I'd be OK with implementing additional test coverage in a separate PR. If you'd like to do that, let me know.

contracts/epochs/FlowEpoch.cdc Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice work. Appreciate the multitude of smaller refectorings, where you have moved auxiliary code into little service methods -- that certainly improves readability of the code.

I have added various suggestions for extending the documentation. However, given my very limited knowledge of cadence and the epoch smart contracts, I don't feel sufficiently confident in my abilities to spot potential problems/errors to approve this PR.

⚠️ There is one possibly significant challenge [update: its not a big risk; see Jordan's comment below] that I noticed:

  • the FlowEpoch smart contract offers two entry points for recovery:
    1. recoverNewEpoch which requires that the counter for the recovery epoch matches the smart contract's current epoch
    2. recoverCurrentEpoch enforces that counter for the recovery epoch matches is one bigger than the smart contract's current epoch
  • I think this places strong limitations on the scenarios we can successfully recover from (specifically the time frame in which a recovery must be successful). Lets unravel this a bit:
    • so initially we assume that the Protocol State and the Epoch Smart Contract are on the happy path: both counters are (largely) in sync
    • then there is a problem and the Protocol State goes into EFM. That means for the running network, where the protocol state is the source of truth that determines operation, the network remains on the epoch counter N
    • However, while the protocol state stays on epoch N (extending it until successful recovery), the smart contract can continue to progress through its speculative epochs.
    • I think it is very likely that failures will occur relatively close to the desired epoch switchover, because the Epoch Setup phase only ends a few hours before the target transition and that is where problems typically occur. Lets say its 3 hours before the target transition and the protocol state goes into EFM and stays at epoch N.
    • The protocol state continues its work and enters epoch N+1. Everyone is stressed because the network is in EFM, some people might be OOO, the engineers are doing the best they can. The engineers trying to recover the epoch lifecycle know that they have to specify the next epoch: They query the smart contract, which tells them the system is currently in epoch N+1. So the engineers specify epoch N+2 and call recoverNewEpoch. The smart contract is happy, and emits a recovery event for epoch N+2 and enters epoch N+2 ... but the protocol state rejects the recovery because it is still in epoch N and expects epoch N+1 to be specified. And then we are screwed: the protocol state must receive a recovery epoch N+1 but the smart contract is already at N+2, it only accepts recovery data for epochs with counter ≥ N+2! ☠️
    • different scenario: due to typos, stress and unfamiliarity with the recovery process the first two calls to recoverNewEpoch emit an event (each increasing the counter) which are both rejected. We end up in a similar scenario: the smart contract's epoch counter has already progressed beyond the expected value for the dynamic protocol state.
    • Other scenario: too many partner nodes are offline and we would like to get them back online before attempting an epoch recovery ... reaching out and helping the partners might take some time. The network is running fine (just saying in its current EFM epoch). We decide to leave the system in EFM for more than a week (presumably nothing bad will happen), but forget to call epochReset ... so after a week the smart contract is now in epoch N+2 while the Protocol State s still in N.

Essentially our current smart contract implementation makes the very limiting assumption that the Protocol State's Epoch counter can be at most one behind the smart contract. Otherwise, we have no means for recovery.
Lets keep in mind that we are implementing a disaster prevention mechanism here: its very rare so no one really has much experience with it, occurrences of disasters cannot be planned for, people are stressed and engineers with the deep background might be unavailable, the first EFM might happen in a year, when we have already forgotten some of the critical but subtle limitations.

Hence, I am strongly of the opinion that this process should be as fault-proof as possible:

  • multiple/many failed recovery attempts should be possible
  • the system should provide ample time for successful recovery (certainly more than a week)
  • it should be nearly impossible to for failed recovery attempts to break anything (no matter how broken the inputs are)

I think we are pretty close but have two main hurdles:

  1. We should prepare for the scenario where the protocol state is in EFM epoch N but the smart contract believes the system is in epoch N+k for any integer k. That would be something to solve as part of this PR (or a subsequent smart contract PR).

  2. ideally, the fallback state machine guarantees that a successful RecoverEpoch event always is a valid epoch configuration. The recovery parameters might be manually set, so the risk of human error should be mitigated. What is missing is checking:

    • that the cluster QCs are valid QCs for each collector cluster
    • DKG committee has sufficient intersection with the consensus committee to allow for live consensus

    This is out of scope of this PR.

As usual, we should be weighing how much engineering time that actually would take to implement. Nevertheless, it deeply worries me that we have a bunch of subtle footguns in our implementation, in that we might irreparably break mainnet in case we violate one of the several subtle constrains (either by human error, or even worse by not acting for only a week).

Also cc @durkmurder @jordanschalm for visibility, comments and thoughts.

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Outdated Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Outdated Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Outdated Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Outdated Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Show resolved Hide resolved
@jordanschalm
Copy link
Member

jordanschalm commented Aug 9, 2024

Responding to Alex's comment here 👇

We should prepare for the scenario where the protocol state is in EFM epoch N but the smart contract believes the system is in epoch N+k for any integer k

You outlined a few scenarios in your comment, but each of them relies on the smart contract continuing to transition through speculative epochs without the Protocol State following suit.

In practice the smart contract transition process provides a strong guarantee that $k \in { 0,1 }$ before any recover_epoch attempt happens.

Smart Contract Transition Logic

  • The smart contract transitions to the next epoch when (1) it is executed in the context of a block with view >= currentEpoch.FinalView and (2) it is in the EpochCommitted phase.
  • The smart contract enters the EpochCommitted phase after the DKG and cluster QC vote generation are successfully completed.
  • So, in order to transition epochs, the smart contract requires Protocol participation in the corresponding DKG and cluster QC voting processes.
  • In EFM, Protocol participants don't participate in the DKG or cluster QC voting.

Outstanding Problems

Invoking recoverNewEpoch with inconsistent inputs

due to typos, stress and unfamiliarity with the recovery process the first two calls to recoverNewEpoch emit an event

This is a very good point and why we added the recoveryEpochCounter as an argument. But, if that parameter is not set properly, then we can end up in a situation where $k>1$.

Like with reset_epoch, we have an automated tool which reads the current Protocol State and writes the recover_epoch transaction arguments. The intention of this is to minimize the impact of incorrect manual inputs, but of course it is always possible.

Let's consider the alternative. If we want to be able to recover from cases where $k>1$ we need to implement support in the recovery process for deleting or overwriting potentially multiple historical epoch entries from the smart contract state before injecting the recovery epoch. This increases implementation complexity and introduces additional surface area for human error to make mistakes ("Oops! I deleted the last 10 epochs"). I'm not convinced this is better.

Extra input validation

  1. Ideally, the fallback state machine guarantees that a successful RecoverEpoch event always is a valid epoch configuration. [...]

Agree with this, just adding that any configuration validation we add to the FallbackStateMachine should also be added in the utility generating the recover_epoch transaction arguments (if possible). That way problems are caught earlier. The cluster QC validation is already done in GenerateClusterRootQC, but we can add the DKG committee size sanity check.

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice expansion of the test coverage -- thank you.

Summary of feedback:

  • If I'm understanding correctly, we don't have a test case that executes recovery during the staking phase -- I think we should add this before merging
  • I added some questions about the last test case (we're doing two recoveries back-to-back and I'm not sure why)

contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
lib/go/test/flow_epoch_test.go Outdated Show resolved Hide resolved
Copy link
Member

@joshuahannan joshuahannan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Just have some questions and small comments

contracts/epochs/FlowClusterQC.cdc Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
contracts/epochs/FlowEpoch.cdc Outdated Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Outdated Show resolved Hide resolved
lib/go/test/epoch_test_helpers.go Outdated Show resolved Hide resolved
@jordanschalm jordanschalm merged commit 872ffe7 into feature/efm-recovery Oct 25, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants