Skip to content

Latest commit

 

History

History
212 lines (152 loc) · 8.29 KB

STATE-COMPATIBILITY.md

File metadata and controls

212 lines (152 loc) · 8.29 KB

State-Compatibility

It is critical for the patch releases to be state-machine compatible with prior releases in the same minor version. For example, v3.2.1 must be state-machine compatible with v3.2.0.

This is to ensure determinism, i.e., given the same input, the nodes will always produce the same output.

State-incompatibility is allowed for major upgrades because all nodes in either the consumer or provider networks perform it at the same time. Therefore, after the upgrade, the nodes continue functioning in a deterministic way.

Scope

The state-machine scope includes the following areas:

  • All ICS messages including:
    • Every msg's ValidateBasic method
    • Every msg's MsgServer method
      • Net gas usage, in all execution paths
      • Error result returned
      • State changes (namely every store write)
  • AnteHandlers in "DeliverTx" mode
  • All BeginBlock/EndBlock logic

The following are NOT in the state-machine scope:

  • Events
  • Queries that are not whitelisted
  • CLI interfaces

Validating State-Compatibility

CometBFT ensures state compatibility by validating a number of hashes that can be found here.

AppHash and LastResultsHash are the common sources of problems stemming from our work. To avoid these problems, let's now examine how these hashes work.

AppHash

Note: The following explanation is simplified for clarity.

An app hash is a hash of hashes of every store's Merkle root that is returned by ABCI's Commit() from Cosmos-SDK to CometBFT. Cosmos-SDK takes an app hash of the application state, and propagates it to CometBFT which, in turn, compares it to the app hash of the rest of the network. Then, CometBFT ensures that the app hash of the local node matches the app hash of the network.

LastResultsHash

LastResultsHash is the root hash of all results from the transactions in the block returned by the ABCI's DeliverTx.

The LastResultsHash in CometBFT v0.34.29 contains:

  1. Tx GasWanted

  2. Tx GasUsed

GasUsed being Merkelized means that we cannot freely reorder methods that consume gas. We should also be careful of modifying any validation logic since changing the locations where we error or pass might affect transaction gas usage.

There are plans to remove this field from being Merkelized in a subsequent CometBFT release, at which point we will have more flexibility in reordering operations / erroring.

  1. Tx response Data

The Data field includes the proto marshalled Tx response. Therefore, we cannot change these in patch releases.

  1. Tx response Code

This is an error code that is returned by the transaction flow. In the case of success, it is 0. On a general error, it is 1. Additionally, each module defines its custom error codes. For example, x/provider currently has these error codes defined.

As a result, it is important to avoid changing custom error codes or change the semantics of what is valid logic in transaction flows.

Note that all of the above stem from DeliverTx execution path, which handles:

  • AnteHandler's marked as deliver tx
  • msg.ValidateBasic
  • execution of a message from the message server

The DeliverTx return back to the CometBFT is defined here.

Major Sources of State-incompatibility

Creating Additional State

By erroneously creating database entries that exist in Version A but not in Version B, we can cause the app hash to differ across nodes running these versions in the network. Therefore, this must be avoided.

Changing Proto Field Definitions

For example, if we change a field that gets persisted to the database, the app hash will differ across nodes running these versions in the network.

Additionally, this affects LastResultsHash because it contains a Data field that is a marshaled proto message.

Returning Different Errors Given Same Input

// Version A
func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error {
    if amount.IsNegative() {
        return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive or zero")
    }
    return nil
}
// Version B
func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error {
    if amount.IsNegative() || amount.IsZero() {
        return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive")
    }
    return nil
}

Note that now an amount of 0 can be valid in "Version A", but invalid in "Version B". Therefore, if some nodes are running "Version A" and others are running "Version B", the final app hash might not be deterministic.

Additionally, a different error message does not matter because it is not included in any hash. However, an error code sdkerrors.ErrInvalidRequest does. It translates to the Code field in the LastResultsHash and participates in its validation.

Variability in Gas Usage

For transaction flows (or any other flow that consumes gas), it is important that the gas usage is deterministic.

Currently, gas usage is being Merklized in the state. As a result, reordering functions becomes risky.

Suppose my gas limit is 2000 and 1600 is used up before entering someInternalMethod. Consider the following:

func someInternalMethod(ctx sdk.Context) {
  object1 := readOnlyFunction1(ctx) # consumes 1000 gas
  object2 := readOnlyFunction2(ctx) # consumes 500 gas
  doStuff(ctx, object1, object2)
}
  • It will run out of gas with gasUsed = 2600 where 2600 getting merkelized into the tx results.
func someInternalMethod(ctx sdk.Context) {
  object2 := readOnlyFunction2(ctx) # consumes 500 gas
  object1 := readOnlyFunction1(ctx) # consumes 1000 gas
  doStuff(ctx, object1, object2)
}
  • It will run out of gas with gasUsed = 2100 where 2100 is getting merkelized into the tx results.

Therefore, we introduced a state-incompatibility by merklezing diverging gas usage.

Secondary Limitations To Keep In Mind

Network Requests to External Services

It is critical to avoid performing network requests to external services since it is common for services to be unavailable or rate-limit.

Imagine a service that returns exchange rates when clients query its HTTP endpoint. This service might experience downtime or be restricted in some geographical areas.

As a result, nodes may get diverging responses where some get successful responses while others errors, leading to state breakage.

Randomness

Randomness cannot be used in the state machine, as the state machine must be deterministic.

Note: Iteration order over maps is non-deterministic, so to be deterministic you must gather the keys, and sort them all prior to iterating over all values.

Parallelism and Shared State

Threads and Goroutines might preempt differently in different hardware. Therefore, they should be avoided for the sake of determinism. Additionally, it is hard to predict when the multi-threaded state can be updated.

Hardware Errors

This is out of the developer's control but is mentioned for completeness.