NBLAST for large datasets

Experimental version of navis' NBLAST containing a few tweaks to enable blasting of very large (upwards 100k) sets of neurons.

❗ If you need to NBLAST only a few thousand neurons, you are probably better off with the implementation in navis!

At this point we offer two variants:

1. NBLAST to disk

This implementation immediately writes scores to disk. Notably, we:

Use Zarr to create an on-disk array which is then populated in parallel by the spawned NBLAST processes.
Use Dask to wrap the Zarr array as on-disk DataFrame.
Some (minimal) effort to balance the load for each process.

>>> from nblast import nblast_disk
>>> # `queries`/`targets` are navis.Dotprops
>>> scores = .nblast_disk(queries, targets,
...                       out='./nblast_results'  # where to store Zarr array with results
...                       return_frame=True)      # default is `return_frame=False`
>>> # Convert Dask to in-memory pandas DataFrame
>>> scores = scores.compute()

The scores are stored as Zarr array in ./nblast_results/scores. To re-open:

>>> import zarr
>>> # Open the array
>>> z = zarr.open_array("./nblast_results/scores", mode="r")
>>> # IDs of queries/targets are stored as attributes
>>> z.attrs['queries']
[5812980561, 5813056261, 1846312261, 2373845661, ...]

Why Zarr for the on-disk array?

Because Zarr has a very easy to use interface for coordinating parallel writes.

Is this slower than the original navis implementation?

Not as far as I can tell: the overhead from using Zarr/Dask appears surprisingly low. It does, however, introduce new dependencies and the returned Dask DataFrame is more difficult to handle than run-of-the-mill pandas DataFrames.

2. NBLAST to sparse matrix

This implementation drops scores below a given threshold and stores results in a sparse matrix. This is useful if you don't care about low (say less than 0) scores.

>>> from nblast import nblast_sparse
>>> # `queries`/`targets` are navis.Dotprops
>>> scores = nblast_sparse(queries, targets, threshold=0)
>>> scores.dtypes
603270961     Sparse[float32, nan]
2312333561    Sparse[float32, nan]
1934410061    Sparse[float32, nan]
...
>>> scores.sparse.density
0.05122

scores is a pandas DataFrame where each column is a sparse array. It will, in many ways, handle like a normal DataFrame. See the docs for details. As things stand, you will have to convert the DataFrame dense (scores.sparse.to_dense()) in order to save it to disk (e.q. as feather or parquet) unless you want to use pickle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NBLAST for large datasets

1. NBLAST to disk

Why Zarr for the on-disk array?

Is this slower than the original navis implementation?

2. NBLAST to sparse matrix

Files

README.md

Latest commit

History

README.md

File metadata and controls

NBLAST for large datasets

1. NBLAST to disk

Why Zarr for the on-disk array?

Is this slower than the original navis implementation?

2. NBLAST to sparse matrix