-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider caching expensive flywire id lookups #67
Comments
Have taken a look at this. It makes sense to compress the data that is cached. Switching from character to compressed int64 reduces space by ~30x (10x for int64, 3x for compression). gzip is the best general purpose compression and adds about 5% penalty when there is a cache miss and is very fast decompressing. I found that brotli was actually a small improvement but only with a non-default quality=2; this took very slightly longer to decompress but compressed faster and smaller and I think on balance that makes it the winner. The brotli package is maintained by Jeroen Ooms, so although it is another dependency, I eventually opted to add it to suggests and make it the default if available. For the data compression step specifically for
For, brotli(q=2) for :
the cache read time was ~10 vs 15ms for gzip vs brotli, which was actually dwarfed by the character conversion time (~75ms) when character output was needed. |
The current default is 5000 items in the lru cache. I wonder if it would be wort making it smaller (1000?), but adding an option that could make it larger for expert use when there is plenty of memory available. |
Caching in action:
|
Should maybe consider cacheing to disk. Likely still very fast with a good backend and probably quite useful since these mappings are permanent. Also maybe it's better not to risk filling memory. cachem may be a good option now since it is the default for memoise (although not 100% certain it goes back to R 3.5). |
cachem installs ok on linux under R 3.5 and the cache_layered option seems like it would be a nice fit because when keys are set , they are set in all layers (i.e. on disk as well as memory) so this should give a persistent store on disk as well as a fast store in memory. |
So far we have not been caching supervoxel id to root id lookups since they can change. However we could do this as follows:
|
Performance considerations:
This feels like the standard memoise approach isn't going to work and we are going to need a standard database or specialised kv backend. Looking through the R ecosystem, these are what I can find: |
It might make more sense to update invalidated rootids with a signalling value 0 rather than drop them since that might cause a lot of churn and there is a strong chance that we will want to update all the supervoxel ids for an invalidated root id.
That leaves redux/redis which looks somewhat involved and thor which is potentially simpler, but unclear if LMDB supports the kind of reciprocal key/value lookups that we need. All told, I wonder if a doubly indexed sqlite table might be the simplest option for this use case. The svids should be unique so that could be the primary key of the table. |
rootid->supervoxels
this can be cached essentially permanently since rootids mutate whenever the mapping changes
however they are quite large, so the object store could quickly start taking a lot of space. For example this fairly large neuron
generates ~ 270000 svids occupying ~ 20 Mb as character vector. This could be reduced to ~ 0.7 Mb by compression of binary data.
The text was updated successfully, but these errors were encountered: