(fix): use dask array for missing element in dask concatenation #1780

ilan-gold · 2024-11-27T10:00:22Z

To test:

from pathlib import Path
import time
import anndata as ad
import numpy as np
import zarr
import h5py

def read_as_dask(store: str) -> ad.AnnData:
    """\
    Read from a hierarchical Zarr array store.

    Parameters
    ----------
    store
        The filename, a :class:`~typing.MutableMapping`, or a Zarr storage class.
    """
    if not isinstance(store, str):
       raise ValueError("Only string paths are supported")

    if store.endswith(".h5ad"):
        f = h5py.File(store, "r")
    elif store.endswith(".zarr"):
        f = zarr.open(store, mode="r")
    else:
        raise ValueError("Unknown file format")

    # Read with handling for backwards compat
    def callback(func, elem_name: str, elem, iospec):
        if iospec.encoding_type == "anndata" or elem_name.endswith("/"):
            return ad.AnnData(
                **{
                    k: ad.experimental.read_dispatched(v, callback)
                    for k, v in dict(elem).items()
                    if not k.startswith("raw.")
                }
            )
        elif elem_name.startswith("/raw"): # remove or add what you need but beware missing elements so proceed with caution
            return None
        elif iospec.encoding_type in {
            "csr_matrix",
            "csc_matrix",
            "array",
        }:
            return ad.experimental.read_elem_as_dask(elem)
        elif iospec.encoding_type == "dict":
            return {k: ad.experimental.read_dispatched(v, callback=callback) for k, v in elem.items()}
        return ad.io.read_elem(elem)

    adata = ad.experimental.read_dispatched(f, callback=callback)

    return adata

shape = (100_000, 10_000)
n_datasets = 2
layer_key = "foo"
def gen_path(i: int):
    return f"data/test_{i}.zarr"
arr = None
for i in range(n_datasets):
    file_path = Path(gen_path(i))
    if not file_path.exists():
        if arr is None:
            arr = np.random.random(shape)
        adata = ad.AnnData(X=arr)
        if i == 0:
            adata.layers[layer_key] = arr
        adata.write_zarr(file_path)
adatas = [read_as_dask(gen_path(i)) for i in range(n_datasets)]
assert sum(layer_key in a.layers for a in adatas) == 1

t = time.time()
concatenated = ad.concat(adatas, join="outer")
print('Concatenation took: ', time.time() - t)

On main this takes about 30 seconds, now .3.

With the python profiler you can see where the performance hit was coming from - by sending in numpy arrays as missing values previously instead of dask, we were triggering dasks tokenization mechanism for an in-memory data structure:

       56    0.000    0.000   34.738    0.620 tokenize.py:47(tokenize)
       56    0.001    0.000   34.737    0.620 tokenize.py:33(_tokenize)
   124/56    0.001    0.000   34.734    0.620 tokenize.py:141(_normalize_seq_func)
  338/138    0.000    0.000   34.734    0.252 tokenize.py:142(_inner_normalize_token)
   139/77    0.001    0.000   34.733    0.451 utils.py:767(__call__)
       10    0.000    0.000   34.725    3.472 core.py:4694(asarray)
        4    0.738    0.184   34.699    8.675 tokenize.py:401(normalize_array)
      208    0.000    0.000   27.179    0.131 hashing.py:94(hash_buffer_hex)
      208    0.000    0.000   27.178    0.131 hashing.py:73(hash_buffer)
      208    0.000    0.000   27.178    0.131 hashing.py:63(_hash_sha1)
      209   27.176    0.130   27.176    0.130 {built-in method _hashlib.openssl_sha1}

Closes Dask Concatenation Should Impute With Dask Array, not Numpy #1777
Tests added
Release note added (or unnecessary)

codecov · 2024-11-27T10:22:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.53%. Comparing base (7d9fba8) to head (3979acf).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1780      +/-   ##
==========================================
- Coverage   87.01%   84.53%   -2.48%     
==========================================
  Files          40       40              
  Lines        6075     6080       +5     
==========================================
- Hits         5286     5140     -146     
- Misses        789      940     +151

Files with missing lines	Coverage Δ
src/anndata/_core/merge.py	`84.04% <100.00%> (-10.94%)`	⬇️

... and 7 files with indirect coverage changes

…ig/missing_element

(fix): use dask array for missing element in concatenation

3009058

ilan-gold added the skip-gpu-ci label Nov 27, 2024

(fix): generate correct dask fill array

720bd0f

ilan-gold added this to the 0.11.2 milestone Nov 28, 2024

ilan-gold changed the title ~~(fix): use dask array for missing element in concatenation~~ (fix): use dask array for missing element in dask concatenation Nov 28, 2024

ilan-gold added 7 commits November 28, 2024 11:29

(chore): remove comment

a728b21

(chore): refactor off-axis-size

073f4c6

Merge branch 'main' into ig/missing_element

e4a871d

(fix): propagate fill value

14c20c0

(chore): remove fix for broken test that then breaks things

f5691d6

Merge branch 'ig/missing_element' of github.com:scverse/anndata into …

49a758e

…ig/missing_element

(chore): remove erroneous difference

3979acf

ilan-gold requested a review from flying-sheep November 28, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): use dask array for missing element in dask concatenation #1780

(fix): use dask array for missing element in dask concatenation #1780

ilan-gold commented Nov 27, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading

(fix): use dask array for missing element in dask concatenation #1780

Are you sure you want to change the base?

(fix): use dask array for missing element in dask concatenation #1780

Conversation

ilan-gold commented Nov 27, 2024 • edited Loading

codecov bot commented Nov 27, 2024 • edited Loading

Codecov Report

ilan-gold commented Nov 27, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading