-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for sparse arrays inside dask arrays #1114
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1114 +/- ##
===========================================
- Coverage 84.97% 52.43% -32.55%
===========================================
Files 36 36
Lines 5197 5287 +90
===========================================
- Hits 4416 2772 -1644
- Misses 781 2515 +1734
Flags with carried forward coverage won't be shown. Click here to find out more.
|
anndata/_core/anndata.py
Outdated
X=_safe_transpose(X) if X is not None else None, | ||
# X=t_csr(X) if X is not None else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flying-sheep do you remember why we would convert to csr here? It seems like an odd choice considering transpose is "free" if we don't insist on the result type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think older versions of scanpy or anndata assumed csr, but I don’t really know.
I guess we should run scanpy’s test suite on this branch before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine... But part of me feels like we should do a deprecation.
Tbh I think transposing is pretty rare. Maybe we change it with a release candidate and if anyone complains we do a deprecation?
@selmanozleyen pinging you for a look at this. It's pretty close to done I think, just 3 failing tests left. WDYT? |
anndata/tests/helpers.py
Outdated
assert_equal(b, a.compute(), exact, elem_name) | ||
# TODO: Figure out why we did this | ||
# from dask.array.utils import assert_eq | ||
# | ||
# I believe the above fails for sparse matrices due to some coercion to np.matrix | ||
# if exact: | ||
# assert_eq(a, b, check_dtype=True, check_type=True, check_graph=False) | ||
# else: | ||
# # TODO: Why does it fail when check_graph=True | ||
# assert_eq(a, b, check_dtype=False, check_type=False, check_graph=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember not wanting to use compute for the equality check to see if the lazy version works fine with the assertion. However, it would also check if the computation graph was also the same, which turned out to fail. Because lazy equality check should also be dasks feature I think we can maybe write a assert_eq function for two dask arrays and compute if other isn't a dask array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dask's assert_eq
always computes the values. To me, this makes sense for a test assertion, since it's okay if it's a bit slow.
assert_eq
definition: https://github.com/dask/dask/blob/b015ea8bf87f4c525379719f8a1cf2851c55bb58/dask/array/utils.py#L284_get_dt_meta_computed
inner function definition: https://github.com/dask/dask/blob/b015ea8bf87f4c525379719f8a1cf2851c55bb58/dask/array/utils.py#L245
Just graph based comparisons are probably a better fit for the anndata._core.merge.equals
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very exciting!
I am not really familiar with the transpose code so I glossed over it. I wrote some comments on parts of code that I think would be good to clarify (e.g. indexing and equal functions).
Some new warnings are present in the tests and capturing or possibly fixing them would be important imo.
# TODO: this may have been working for some cases? | ||
subset_idx = np.ix_(*subset_idx) | ||
return a.vindex[subset_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for some sparse cases? I don't understand the problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the vindex
case would be more efficient by doing indexing across both dimensions at the same time. Though I haven't actually checked this.
But I believe there were some test cases passing without this change which are now going through the new branches. It could be worth making those cases go through vindex
instead of [obs_idx, :][:, var_idx]
if isinstance(a._meta, spmatrix): | ||
# TODO: Maybe also do this in the other case? | ||
return da.map_blocks(equal, a, b, drop_axis=(0, 1)).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason to not use np.equal? Wouldn't it check type for each block even though the block type is the same for all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.equal
doesn't work on sparse arrays:
In [3]: X = sparse.random(
...: 1000,
...: 100,
...: format="csr",
...: density=0.01,
...: random_state=np.random.default_rng(),
...: )
In [4]: np.equal(X, X)
/usr/local/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3508: SparseEfficiencyWarning: Comparing sparse matrices using == is inefficient, try using != instead.
exec(code_obj, self.user_global_ns, self.user_ns)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], line 1
----> 1 np.equal(X, X)
File /usr/local/lib/python3.9/site-packages/scipy/sparse/_base.py:332, in _spbase.__bool__(self)
330 return self.nnz != 0
331 else:
--> 332 raise ValueError("The truth value of an array with more than one "
333 "element is ambiguous. Use a.any() or a.all().")
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
Plus we want the nan
handling behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, sorry, I guess I made a mistake because it somehow passed for me when I wrote it, but now it has failed. Then it would make sense if that is just the else. However it both fails for me in this case
def test_sparse_vs_dense_dask():
data = np.random.random((100, 100))
x = as_sparse_dask_array(data)
y = as_dense_dask_array(data)
assert equal(y,x)
assert equal(x,y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that is also true for in-memory dense and sparse arrays:
from anndata._core.merge import equal
from scipy import sparse
s = sparse.random(100, 100, format="csr", density=0.1)
equal(s, s.toarray()) # False
equal(s.toarray(), s) # True
But I think I would consider this a separate issue. In general, these functions currently assume that the array types being compared are the same (which fits the use case).
A real solution here would need an array type promotion hierarchy:
- If you are comparing dense to sparse, which method gets used?
- If you use
equal
, and some elements are dense, some are sparse, what gets returned?
Co-authored-by: Selman Özleyen <[email protected]>
It wasn't actually doing anything wrong at the time it seems.
I got a little excited. Maybe this will be 0.10.0.
SparseDataset
and append, instead do something like:TODO:
sparse_matrix.transpose(axes=(1, 0))
scipy/scipy#19161I've now tried making it faster using the approach above. It's faster, but like 20% for default zarr settings. Here's that implementation:
Some timing results:
Setup
From profiling, it seems like a lot of the difference is from:
I think a lot of this could be made up with more write buffering.
But also this is way more complex.
Transpose
Tranpose is an issue since
scipy
doesn't let you passaxes
totranspose
anddask
always passes that argument.I've opened a PR to scipy to fix, but we'll need to work around this until then. Either with a fix or disallowing this.
I think the fix should look like:
Indexing
dask.Array.vindex
is broken when the meta array is a sparse matrix. Also maybe when it's a densenp.matrix
class, but that is less important.