-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent version of btree #10
Comments
While I think a persistent version might be nice, I'm not sure if it'd fit On Wed, Sep 21, 2016 at 10:24 PM, keep94 [email protected] wrote:
|
Sounds good to me. By the way, this is a great library. |
In this post, I will discuss design goals and my high level plan for implementing: Design goals:
Non goals:
High level design: I plan to reuse the node struct for nodes in a persistent btree along with the code for the node struct. The difference with the persistent btree is that I will employ copy-on-write instead of modifying existing nodes in place. That is, whenever I need to change a node, I will make a copy of it first and then mutate the copy in place. To prevent the creation of gratuitous intermediate objects, I will pass around a set of node pointers that have already been copied for write. I will employ copy-on-write only when a node is not already in the set of writable node pointers. The lifespan of this set is only for one group of batch changes. At the beginning of a batch change I allocate an empty version of this set as a local variable. I pass the set around as a parameter to all the functions as the batch changes are happening. When the changes are done, the set goes away. The need to pass the copy-on-write set around for changes on persistent btrees complicates the code reuse, but I get around this by making the operations on btree nodes higher order functions. For each existing node function in btree, I make a helper function that does the same thing. The helper function takes the exact same parameters as the existing function, plus the writable node set, plus shims for making the recursive calls. The persistent versions will have to make recursive calls to themselves and update child pointers while the existing conventional functions will have to make calls to themselves modifying child pointers in place. Each existing node function will simply delegate to its helper function passing nil for the copy-on-write set along with mutable style shims for the recursive calls. In addition, each existing node function will get a persistent version that also takes the same parameters plus the copy-on-write set. The persistent version will return the same values, plus a node pointer. The persistent version of each function will first ask the copy-on-write set for a writable copy of its receiver. Then it will call the helper function on that writable copy passing the copy-on-write set to it along with persistent style recursive call shims that will call the persistent version and updates child pointers. Finally, the persistent version will return the values that the helper function returned plus the writable copy of the receiver. I will be able to apply this work in a cookie cutter fashion to all the code of the *node struct. Sample Work: You can see the start of this work on the first three methods of the node struct here keep94@8e32a6d Conclusion: The use of the copy-on-write set allows me to reuse the existing node objects with minimal changes to implement persistent btree. While copy-on-write set will cause additional work for GC during mutations for persistent btree, I believe it is a fine trade off for maintaining the elegance and correctness of the existing btree code. Moreover, using a copy-on-write set when mutating persistent btrees will not affect the operation of existing btrees. The only changes to the code path of the existing btree code will be extra function calls for the helper functions and the shims for doing the recursion. Function calls are cheap these days, so I believe it will have minimal impact on performance for the existing code. Please let me know whether or not this work is right for this google/btree repo. Thank you in advance. |
I have implemented what I proposed above and have written tests. However, instead of using shims, I decided to let nil copy-on-write sets allow requests for writable versions of nodes, in which case it just returns the node unchanged. In the ephemeral case, asking for a writable version of a node is the same as assigning a node pointer to itself. This small change allowed me to leave the call sites of the recursive calls unchanged obviating the need for shims and greatly simplifying the changes I made. Without the shims, the size of the call stack is about the same as it was before. I ran the benchmarks and compared with the master branch. Although I took great effort to disturb the existing code as little as possible, the benchmarks for insert and delete run 3 to 4 percent slower with my changes than before. That is about 675 ns/op on my small mac mini vs 650 ns/op. I am running go 1.1.2. I attribute the slight slow down to the overhead of asking for writable copies of nodes. Although this is essentially a no-op in the ephemeral case, the function call to do it does cost something and the call is made as often as other mutating calls. The 3 to 4% slow down may be a fair trade off considering that I am reusing most of the existing code. Code reuse is a good thing as it is easier to maintain. Let me know what you think. |
API proposal: type ImmutableBTree struct { // NewImmutable creates an empty tree with given degree. // ImmutableBTree has all the read-only methods that BTree has like Get(). type Builder struct { // NewBuilder creates a new builder starting from tr. // Builder has all the methods that BTree has including all the read methods plus... // Set sets this builder to tr giving this builder the same degree as tr. // Build builds a tree with the same items and degree as this builder. |
Why not just do this: type ImmutableBTree interface {
... all read methods of btree...
} To make IBT: func MakeImmutableBTree() {
b := NewBTree()
... add stuff ...
return ImmutableBTree(b)
} |
Likely btree will get new read methods. If we add new methods to the On Friday, 7 October 2016, Graeme Connell [email protected] wrote:
|
I'm happy to have the interface as part of the main package, if you want. On Fri, Oct 7, 2016 at 9:49 AM, keep94 [email protected] wrote:
|
Regarding your proposal of adding the interface, what becomes of the On Friday, 7 October 2016, Graeme Connell [email protected] wrote:
|
If I understand correctly, you suggest I wrap a normal btree instance in an With the interface I'd still have to do a full defensive copy of the On Friday, 7 October 2016, Travis Keep [email protected] wrote:
|
If the extra API load of having the builder concerns you, I did think of a way that we can safely fold the Build and Set methods into BTree and do away with the Builder class completely, but the tradeoff is that the Build method will run in sub-linear time, average case, instead of constant time. If we change the Build method so that it does deep copies of nodes that are reachable only by the BTree and do shallow copies of shared nodes, then Build becomes a truly read-only method and can be safely folded into BTree without confusion. The downside is that the Build method becomes much slower. For a BTree built from scratch where all nodes are reachable only from the btree, Build would run in O(n) time and be the exact same as a deep copy. Even with this modified build method, the BTree would still have to do copy-on-write when changing shared nodes or else it would mutate other ImmutableBTrees. So, in conclusion folding Build and Set into the BTree class itself would reduce API load by getting rid of the Builder class, but getting a modified copy of an ImmutableBTree would take twice as long as having a builder class, best case, since in addition to doing copy-on-write as modifications happen, the Build method would have to do defensive copying of the same modified nodes yet again. In the worst case, Build would take O(n) time |
If it sounds good to you, I am willing to get rid of the Builder class. The API would look like this. Let me know what you think. type BTree struct { // all the usual methods plus // Build builds an immutable version of this tree // Set changes this instance to have the same items and degree as tree. type ImmutableBTree struct { // ImmutableBTree has all the read-only methods of BTree. |
Is there any plan to implement persistent btrees? |
I implemented a "semi-persistent" in-memory B+ tree: It can be used in a mutable or immutable fashion (the immutable fashion is |
any update? |
This btree data structure is an ephemeral data structure. Changes to a particular btree instance are destructive. Would be nice if there was a persistent, immutable counter part to this data structure. Adding to or deleting from, the persistent version of btree creates a new btree with the original intact. We could have the ability to batch together mutations to cut down on the creation of intermediate instances.
A persistent version of btree would make transactional processing very easy as one could quickly revert to an older version. A persistent, immutable btree helps with concurrency too. On goroutine could read the btree while another is modifying it. A persistent btree helps with undo / redo operations too.
I figure the persistent version would be almost like the ephemeral version except instead of mutating nodes in place, we do copy-on-write. Doing a single mutation to a persistent btree, one add or one delete, would be cheap. The old and new versions would share many of the same nodes. Only log(n) nodes would be different.
With the persistent version of btree, there would be no free list or recycling of nodes. Each node in the btree would be immutable as it could be shared by many different versions of the same btree.
If this sounds interesting to you, I'd be willing to take on the work.
The text was updated successfully, but these errors were encountered: