consider caching expensive flywire id lookups #67

jefferis · 2021-01-14T15:15:33Z

rootid->supervoxels

this can be cached essentially permanently since rootids mutate whenever the mapping changes

however they are quite large, so the object store could quickly start taking a lot of space. For example this fairly large neuron

> system.time(fl <- flywire_leaves('720575940619073968'), gcFirst = F)
   user  system elapsed 
  0.278   0.074   4.431

generates ~ 270000 svids occupying ~ 20 Mb as character vector. This could be reduced to ~ 0.7 Mb by compression of binary data.

The text was updated successfully, but these errors were encountered:

jefferis · 2021-01-16T10:38:32Z

Have taken a look at this. It makes sense to compress the data that is cached. Switching from character to compressed int64 reduces space by ~30x (10x for int64, 3x for compression). gzip is the best general purpose compression and adds about 5% penalty when there is a cache miss and is very fast decompressing. I found that brotli was actually a small improvement but only with a non-default quality=2; this took very slightly longer to decompress but compressed faster and smaller and I think on balance that makes it the winner. The brotli package is maintained by Jeroen Ooms, so although it is another dependency, I eventually opted to add it to suggests and make it the default if available.

For the data compression step specifically for flywire_leaves('720575940619073968', cache = TRUE) which typically takes about 4s:

  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
1 brotli     18.4ms 19.4ms      50.0  693.54KB    0.251   199     1      3.98s <NULL> <Rpro… <bch…
2 gz         94.3ms 96.8ms      10.2    3.06MB    0.525    39     2      3.81s <NULL> <Rpro… <bch…

For, brotli(q=2) for :

bench::mark(brotli=flywire_leaves('720575940619073968', integer64 = TRUE, cache = TRUE, compression="brotli"), min_time = 5)

the cache read time was ~10 vs 15ms for gzip vs brotli, which was actually dwarfed by the character conversion time (~75ms) when character output was needed.

jefferis · 2021-01-16T10:41:42Z

The current default is 5000 items in the lru cache. I wonder if it would be wort making it smaller (1000?), but adding an option that could make it larger for expert use when there is plenty of memory available.

jefferis · 2021-01-16T11:02:59Z

Caching in action:

> kcsel=c("720575940623755722", "720575940609992371", "720575940625494549",
+ "720575940619442047", "720575940620517656", "720575940609793429",
+ "720575940617265029", "720575940631869024", "720575940637441955",
+ "720575940638892789")
> kcpreds=flywire_ntpred(kcsel)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=32s  
> kcpreds2=flywire_ntpred(kcsel)
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s

jefferis · 2021-01-25T00:30:09Z

Should maybe consider cacheing to disk. Likely still very fast with a good backend and probably quite useful since these mappings are permanent. Also maybe it's better not to risk filling memory. cachem may be a good option now since it is the default for memoise (although not 100% certain it goes back to R 3.5).

jefferis · 2021-01-30T09:54:16Z

cachem installs ok on linux under R 3.5 and the cache_layered option seems like it would be a nice fit because when keys are set , they are set in all layers (i.e. on disk as well as memory) so this should give a persistent store on disk as well as a fast store in memory.

jefferis · 2021-02-01T20:04:37Z

So far we have not been caching supervoxel id to root id lookups since they can change. However we could do this as follows:

store svid -> rootid mapping using svid as key
in future, check if the svid exists in the key list
IF YES,
3a. fetch the stored rootid
3b. and then check if that root id has been edited using flywire_islatest
3c. if it has been edited, drop the entry from the key value store and add the svid to the list that need to be queried remotely
3d. note that in fact all keys having that value should be dropped.
IF NO
4a. run remote query
4b. store the result in the key-value store

jefferis · 2021-02-01T21:04:51Z

Performance considerations:

we want to persist the mapping on disk
many (100s thousands) keys need to be mapped to a short value. Therefore we don't want to use a single object per cached value.
we need to be able to drop keys that have a given value so we actually need fast bidirectional lookup
we must store keys / values as bit64
it would be better if we could use the

This feels like the standard memoise approach isn't going to work and we are going to need a standard database or specialised kv backend. Looking through the R ecosystem, these are what I can find:

storr
thor
redux
RcppRedis
filehash (using the DB1 backend)
filehashSQLite

jefferis · 2021-02-02T09:43:25Z

It might make more sense to update invalidated rootids with a signalling value 0 rather than drop them since that might cause a lot of churn and there is a strong chance that we will want to update all the supervoxel ids for an invalidated root id.

filehash DB1 is poorly suited because it cannot efficiently drop keys and inserting on the same key makes a new entry.
filehash SQLIte could work, but it doesn't allow searches by value as well as keys
storr looks interesting but probably not for this use case as the content hashing feels like a poor match. OTOH it abstracts away quite a lot of management issues.
RcppRedis is a bit bare bones

That leaves redux/redis which looks somewhat involved and thor which is potentially simpler, but unclear if LMDB supports the kind of reciprocal key/value lookups that we need.

All told, I wonder if a doubly indexed sqlite table might be the simplest option for this use case. The svids should be unique so that could be the primary key of the table.

jefferis mentioned this issue Jan 16, 2021

Feature/cache flywire leaves #71

Merged

jefferis changed the title ~~consider cacheing expensive flywire id lookups~~ consider caching expensive flywire id lookups Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider caching expensive flywire id lookups #67

consider caching expensive flywire id lookups #67

jefferis commented Jan 14, 2021 •

edited

Loading

jefferis commented Jan 16, 2021

jefferis commented Jan 16, 2021

jefferis commented Jan 16, 2021

jefferis commented Jan 25, 2021

jefferis commented Jan 30, 2021

jefferis commented Feb 1, 2021 •

edited

Loading

jefferis commented Feb 1, 2021

jefferis commented Feb 2, 2021

consider caching expensive flywire id lookups #67

consider caching expensive flywire id lookups #67

Comments

jefferis commented Jan 14, 2021 • edited Loading

jefferis commented Jan 16, 2021

jefferis commented Jan 16, 2021

jefferis commented Jan 16, 2021

jefferis commented Jan 25, 2021

jefferis commented Jan 30, 2021

jefferis commented Feb 1, 2021 • edited Loading

jefferis commented Feb 1, 2021

jefferis commented Feb 2, 2021

jefferis commented Jan 14, 2021 •

edited

Loading

jefferis commented Feb 1, 2021 •

edited

Loading