Trie: Critical DB-Consisteny Bug #3264

TimDaub · 2024-02-01T01:05:02Z

here's what I just wrote down:

I have reported this to Gabriel before but not sure if y'all have looked into this. My suspicion is that somehow the ethereumjs library throws or anyways doesn't finish a write somehow. Or that maybe I kill the ethereumjs execution and then this leads to inconsistency. But in anycase, IMO, I should never manage to get any irrecoverable state from ethereumjs/trie. Happy to answer any questions

TimDaub · 2024-02-01T01:08:51Z

btw I made a backup of that corrupted state in the database.
bootstrap.zip

you can basically run it with these options and it should help you reproduce it

gabrocheleau · 2024-02-02T20:51:05Z

Thanks for reporting this and coming up with a backup and reproduction scenario.

I'll see if I can reproduce this locally and get back. Workload is high currently with ongoing developments, but this is something we need to get to the bottom of as it could perhaps reveal more general issues that go beyond the scope of your implementation.

gabrocheleau · 2024-02-17T22:49:42Z

So to recap and confirm my understanding, here's what I'm getting from the issue:

A MPT is used on your server for data storage & integrity checks.
At some point, the server may crash, and the MPT enters an inconsistent state, where the _root hash has been updated, but not the node corresponding to that root.

Obviously, this should not be happening. However it is hard for me to get a good grasp on the actual issue since it happens in the context of external usage that I have limited familiarity and understanding of.

From the issue you've linked, I see these lines in the logs:

2024-02-01T00:00:58: 2024-02-01T00:00:58.967Z @attestate/kiwistand Stored message with index "65badc1d8b88b49217fda6abdb764eb69cf239d8d8e625eb0737f28762efa39bd9c8c727"
2024-02-01T00:00:58: 2024-02-01T00:00:58.967Z @attestate/kiwistand New root: "e49c93db1dd3c722

And before that, I see this error:

024-02-01T00:00:58: Message: put: Didn't add message to database
2024-02-01T00:00:58: Stack Trace: Error: add: Didn't find root message of comment
2024-02-01T00:00:58:     at Module.add (file:///root/kiwistand/src/store.mjs:361:13)
2024-02-01T00:00:58:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-02-01T00:00:58:     at async put (file:///root/kiwistand/src/sync.mjs:332:7)
2024-02-01T00:00:58:     at async file:///root/kiwistand/src/sync.mjs:444:7

I'm not sure what exactly the expected order of operations, but might this be a race condition where the root is set independantly (e.g. through the trie.root() method) without the put operation actually updating the trie, therefore resulting in a root that is inconsistent with the underlying trie? If the initial issue is server/networking-related (this seems to be the case from the logs, but I might be wrong), would it work to just reset the trie root to the previous one in case of a failure in the put operation, and retry the full operation? Our Statemanager package would likely provide some logic that can serve as an inspiration for that.

If I'm misunderstanding the issue, would you be able to provide a relatively minimal reproductible example (that can be run natively in the context of our repository). Then I'd be able to troubleshoot further, and generate test cases.

For months now the reconciliation algo has been plagued by bugs surrounding the ethereumjs/trie library. We've opened many tickets on Github: - #146 - #131 - ethereumjs/ethereumjs-monorepo#3264 - ethereumjs/ethereumjs-monorepo#3645 The pattern of the problem was always that somehow that `trie.root()` couldn't be found using `trie.checkRoot`, which seemed almost like a contradiction, especially when doing `await trie.checkRoot(trie.root())`. We had initially introduced the checkpointing of the trie because of some rather theoretical problem regarding what would happen if during the reconciliation the trie updates and, at the same times, sends level comparisons to a peer. So to use checkpointing for us was primarily used to implement atomicity when storing data. We wanted to just store the remote trie's leaves in batches as to make sure not to interrupt the algorithm to compare the trie's levels. At the same time, the insertion of new leaves into such a trie is costly as a big part of its hashes have to be recomputed to arrive at a new root. However, I think what has happened with our implementation of the sync.put method is that the checkpointing led to the trie writes not being processed sequentially which also lead to all sorts of problems in the reconciliation. The reconciliation is purposefully built in a way where it first synchronizes old leaves and only then new leaves. While a working reconciliation doesn't have any issues with storing comments, a fundamentally asynchronous reconciliation will attempt to store comments where the original upvote hasn't been made yet, leading to the message not being processed initially. Another big problem ended up being that the ethereumjs/trie library isn't mature with regards to handling the application shutting down, and so a lot of the above mentioned issues actually describe the ethereumjs/trie library reaching a non-recoverable state. Funnily enough, however, all it took to fix all of the above problems was to remove all notions of checkpointing and commits. While it does make the reconciliation algorithm MUCH slower (because it is now synchronous), it also made it much more reliable and almost free of errors during interaction.

holgerd77 · 2024-09-20T08:04:55Z

No follow-up, will close. Feel free to re-open if you have got further information!

TimDaub · 2024-09-20T11:59:29Z

No follow-up, will close.

Some related follow up here
#3645

TimDaub added the package: mpt label Feb 1, 2024

holgerd77 changed the title ~~critical ethereumtrie bug~~ Trie: Critical DB-Consisteny Bug Feb 15, 2024

holgerd77 added type: bug type: discussion / question eval: needs context eval: needs validation prio: P1 critical labels Feb 15, 2024

gabrocheleau self-assigned this Feb 17, 2024

holgerd77 closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trie: Critical DB-Consisteny Bug #3264

Trie: Critical DB-Consisteny Bug #3264

TimDaub commented Feb 1, 2024

TimDaub commented Feb 1, 2024 •

edited

Loading

gabrocheleau commented Feb 2, 2024

gabrocheleau commented Feb 17, 2024

holgerd77 commented Sep 20, 2024

TimDaub commented Sep 20, 2024 •

edited

Loading

Trie: Critical DB-Consisteny Bug #3264

Trie: Critical DB-Consisteny Bug #3264

Comments

TimDaub commented Feb 1, 2024

TimDaub commented Feb 1, 2024 • edited Loading

gabrocheleau commented Feb 2, 2024

gabrocheleau commented Feb 17, 2024

holgerd77 commented Sep 20, 2024

TimDaub commented Sep 20, 2024 • edited Loading

TimDaub commented Feb 1, 2024 •

edited

Loading

TimDaub commented Sep 20, 2024 •

edited

Loading