♻️ Add `.standardize()` to `Curator` and refactor #2186

sunnyosun · 2024-11-18T15:40:33Z

Add `.standardize()` to `Curator`

Before this PR, a a synonym was considered a valid value for a record. This is no longer the case.

Example: Assume you have a bt.CellType registry with this content:

name	synonyms
"T cell"	"T-cell"

And now you curate a dataframe against it:

>> df = pd.DataFrame({"cell_type": ["T-cell", "T cell"]})
>> curator = ln.Curator.from_df(df, categoricals={"cell_type": bt.CellType.name}
# before this PR, the below returned `True` if `"T-cell"` is a synonym of `"T cell"` in bt.CellType
# now, it returns False
>>> curator.validate()
• mapping "cell_type" on CellType.name
!   1 term is not validated: 'T-cell'
    1 synonym found: "T-cell" → "T cell"
    → curate synonyms via .standardize("cell_type")

You can now call curator.standardize() to standardize synonyms.

>>> curator.standardize("cell_type")  # modifies df inplace
>>> # new state of df is pd.DataFrame({"cell_type": ["T cell", "T cell"]})
✓ standardized 1 synonym in "cell_type": "T-cell" → "T cell"

Explain validation concept more clearly in the docs

Before	After

/

Further changes

Refactored Curator to improve code clarity and performance
Added more tests for curators

codecov · 2024-11-18T16:17:35Z

Codecov Report

Attention: Patch coverage is 91.61290% with 13 lines in your changes missing coverage. Please review.

Project coverage is 92.86%. Comparing base (c54f99f) to head (919ac86).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
lamindb/_curate.py	91.50%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2186      +/-   ##
==========================================
+ Coverage   92.36%   92.86%   +0.49%     
==========================================
  Files          54       54              
  Lines        6566     6660      +94     
==========================================
+ Hits         6065     6185     +120     
+ Misses        501      475      -26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-11-20T11:09:28Z

🚀 Deployed on https://67471f2b325c71248f84875f--lamindb-qnwk.netlify.app

falexwolf · 2024-11-26T13:22:37Z

👋 Are you also going to add the BIG BIG box to the docs with the following definitions?

what a validated label & feature identifier is
what a validated record is
what a validated dataset is (one could also call it curated or standardized dataset as these would be all equivalent)

It'd be also super helpful to explain lamindb's validation & curation by contrasting it with pandera, pydantic and other popular validation frameworks.

falexwolf · 2024-11-26T13:34:33Z

And is this a good time to look into?

sunnyosun · 2024-11-27T10:01:14Z

Looking into performance now!

Signed-off-by: zethson <[email protected]>

… into refactor-curator

Zethson

Great!

We are losing about 1.5% coverage with this PR. That's a bit much.
I can't properly comment on the notebooks so I'll make some messy comments here.
In curate-df:

In LaminDB, validation means verifying that values exist in metadata registries. Validation criteria define which registry fields should contain these values.

Technically correct but if I were to read this for the first time, it would sound a bit lose to me. I would feel more comfortable if it were clearer that in many cases these "values that exist in metadata registries" are based on public ontologies.

I'm not sure how useful distinguishing between "Validated label & feature identifier" and "validated record" is. At least the description didn't give much more away to me.
3. The ux of standardize() is a bit weird to me. I'd consider defaulting the key to all. I'd also consider supporting Iterable. And should it be called key? key is programmers language and maybe we should consider something more meaningful to users.
4. I know that our pipeline tests are easier but they're not best practice. See comments in test section.

I made a commit with some minor changes to not bug you with nitpicks.

lamindb/_from_values.py

Zethson · 2024-11-27T09:32:23Z

tests/core/test_curate.py

@@ -98,17 +97,45 @@ def mock_transform():


 def test_df_curator(df, categoricals):


This is generally about test design but we should consider splitting tests a bit more. A general mantra is that a single test tests for a single behavior. We should not have such super tests that are more like pipelines. There's a couple of benefits for this but among them is that the test name and a potential 1 sentence docstring immediately give away what is being tested here.

TLDR: I'd say that with this PR or a followup PR, we should split all of the big curator tests into tiny test functions.

tests/core/test_curate.py

Zethson · 2024-11-27T09:36:49Z

tests/core/test_curate.py

    assert validated is False

+    # deprecated method
+    curator.add_new_from_columns()


There is no assert here. It's a bit weird to use it like this and have us programmers guess whether it had the desired effect or not in the subsequent code in this test.

it's a deprecated method doesn't do anything but needs to be covered, what do you suggest to do?

I guess a pragma: no cover is good enough for that function assuming it calls the non-deprecated one that is covered.

If only pragma: no cover worked as expected sigh.

tests/core/test_curate.py

lamindb/_curate.py

sunnyosun · 2024-11-27T11:02:05Z

We are losing about 1.5% coverage with this PR. That's a bit much.

No, I added a lot more tests, we in fact gained 0.5% coverage, the codecov report up there didn't update, see here: https://app.codecov.io/gh/laminlabs/lamindb/pull/2186?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=laminlabs

Signed-off-by: zethson <[email protected]>

sunnyosun · 2024-11-27T11:06:21Z

The ux of standardize() is a bit weird to me. I'd consider defaulting the key to all. I'd also consider supporting Iterable. And should it be called key? key is programmers language and maybe we should consider something more meaningful to users.

Thought about it, but I think sometimes user wants to standardize in other ways. It's a mutating operation to their dataset, so I want to be careful.

Calling a for loop for .standardize is super easy; I'd not support that to minimize our maintenance.

key is just a generic word I can come up with that can cover all different formats (obs columns, var.index, mudata stuff), I'd like to be more specific if you have a better name.

falexwolf · 2024-11-27T11:30:30Z

I've been confused about the choice Treatment for years meanwhile. It'd be great to have a paragraph somewhere that explains why that choice versus something else. Some people also prefer "intervention" but I think that's less common in biology.

Zethson · 2024-11-27T11:37:00Z

Thought about it, but I think sometimes user wants to standardize in other ways. It's a mutating operation to their dataset, so I want to be careful.

Calling a for loop for .standardize is super easy; I'd not support that to minimize our maintenance.

Well, all is equally as dangerous and we are supporting that. It is easy to call a loop, but it breaks the UX a bit because when using the curator users usually don't write loops for validate() either. It's very cheap for us to support Iterable and I don't think that it adds a lot of maintenance burden? It should be easy to implement and test.

key is just a generic word I can come up with that can cover all different formats (obs columns, var.index, mudata stuff), I'd like to be more specific if you have a better name.

Yeah, that's a lot...maybe:

values_key?
categorical_key? - for consistency with categoricals but it would not cover var_index...
col? var_index is also a column.
col_key

sunnyosun · 2024-11-27T11:47:08Z

Thought about it, but I think sometimes user wants to standardize in other ways. It's a mutating operation to their dataset, so I want to be careful.
Calling a for loop for .standardize is super easy; I'd not support that to minimize our maintenance.

Well, all is equally as dangerous and we are supporting that. It is easy to call a loop, but it breaks the UX a bit because when using the curator users usually don't write loops for validate() either. It's very cheap for us to support Iterable and I don't think that it adds a lot of maintenance burden? It should be easy to implement and test.

key is just a generic word I can come up with that can cover all different formats (obs columns, var.index, mudata stuff), I'd like to be more specific if you have a better name.

Yeah, that's a lot...maybe:

values_key?

categorical_key? - for consistency with categoricals but it would not cover var_index...

col? var_index is also a column.

col_key

But if it's called key or values_key or anything, it implies it's a singular, making it Iterable is very confusing. At least I always find thing confusing if they accept a sigular and a list at the same time. I think having all is enough to make the UX you described above, an Iterable is overkill and even more confusing to name.

Yes "all" is equally dangerous, that's why I'm not making it a default and we should not advertise using it unless necessary in all our guides.

Anything with col is no good, because it can be "var_index".

I'm not sure if values_key is more clear than key, also it's a bit long.

Zethson · 2024-11-27T12:36:32Z

But if it's called key or values_key or anything, it implies it's a singular, making it Iterable is very confusing. At least I always find thing confusing if they accept a sigular and a list at the same time. I think having all is enough to make the UX you described above, an Iterable is overkill and even more confusing to name.

Yes "all" is equally dangerous, that's why I'm not making it a default and we should not advertise using it unless necessary in all our guides.

I think behavior that I've seen before is that None means all. In other words, if we didn't have the all special parameter and rather defaulted to None which means all of them, we could more easily enable Iterable.

falexwolf · 2024-11-27T12:36:49Z

There is a lot of stuff going on here and I'll not read through all code and all comments unless you ask me to.

So, I'll just focus on getting the "what the hell does validation even mean?" question right. The first 5 min in which the user gets confronted with the topic "Curating datasets".

I pushed a commit that effects the following changes. We can iterate from there or revert some of it.

Before	After

I noticed that sometimes we say "validation criteria" and sometimes "validation constraint". We should only use one of them. Which?

Shall we combine the example we have on the "curate" docs page with the one we have on the quickstart (below an updated version) and also throw in a datetime column?

df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"], "cell_type_by_expert": ["B cell", "T cell", "T cell"], "cell_type_by_model": ["B cell", "T cell", "T cell"]},
    index=["sample1", "sample2", "sample3"],
)
adata1 = ad.AnnData(
    df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation", "cell_type_by_expert", "cell_type_by_model"]]
)
curator = ln.Curator.from_anndata(adata1, var_index=bt.Gene.symbol, categoricals={"perturbation": ln.ULabel.name, "cell_type_by_expert": bt.CellType.name,  "cell_type_by_model": bt.CellType.name}, organism="human")
curator.add_new_from("perturbation")
artifact1 = curator.save_artifact(key="example_datasets/dataset1.h5ad")

And, shall we first study the case where everything passes so that it doesn't look so intimidating? I mean, people will use the logging to resolve issues and not read the docs for this anyway I imagine.

We have to be careful with what we put here vs. what we put on the quickstart. But that that can be another iteration.

sunnyosun · 2024-11-27T13:15:06Z

But if it's called key or values_key or anything, it implies it's a singular, making it Iterable is very confusing. At least I always find thing confusing if they accept a sigular and a list at the same time. I think having all is enough to make the UX you described above, an Iterable is overkill and even more confusing to name.
Yes "all" is equally dangerous, that's why I'm not making it a default and we should not advertise using it unless necessary in all our guides.

I think behavior that I've seen before is that None means all. In other words, if we didn't have the all special parameter and rather defaulted to None which means all of them, we could more easily enable Iterable.

If all is the default, everyone will start calling it without passing anything and without thinking. I want users to not do that, but instead actually curate what they think should be standardized. key is not needed at all if we introduce all as default.

Again, I think users should not do things blindly when it comes to add_new_from and standardize.

Zethson · 2024-11-27T13:59:38Z

Again, I think users should not do things blindly when it comes to add_new_from and standardize.

I agree with add_new_from 100%. But standardize() should be fine. Yes, it's inplace but it only matches values that are in the instance or the public ontologies. I don't see the big risks here.

We can gather some feedback from our users first and then still act later if necessary.

sunnyosun · 2024-11-27T14:03:30Z

Again, I think users should not do things blindly when it comes to add_new_from and standardize.

I agree with add_new_from 100%. But standardize() should be fine. Yes, it's inplace but it only matches values that are in the instance or the public ontologies. I don't see the big risks here.

We can gather some feedback from our users first and then still act later if necessary.

For example, Felix's issue with some numbers in synonyms will be a disaster if applied blindly. So I think users still need to be careful with standardize.

falexwolf · 2024-11-27T18:47:08Z

Can this PR get a much more comprehensive description with examples and screenshots etc.?

This seems cryptic

And it seems to miss the biggest change that .standardize() is actually called by default and might mutate the object that was passed.

It's very important to have clear logging in case the object is mutated.

Can you please add an example for a small dataframe that is mutated via of synonyms-mapping and then issues this logging message?

I hope you can just use one that's in the tests.

falexwolf · 2024-11-28T08:50:35Z

Ping @sunnyosun! :D

🗑️ Deprecate add_new_from_columns

b0cd035

sunnyosun changed the title ~~🗑️ Deprecate add_new_from_columns~~ ♻️ Refactor Curator Nov 18, 2024

sunnyosun added 2 commits November 18, 2024 16:56

💚 Fix

410fb68

💚 Fix

f09dcaa

sunnyosun added 2 commits November 19, 2024 19:06

✨ Standardize_categories

9f378a8

Merge branch 'main' into refactor-curator

fdda9d9

github-actions bot temporarily deployed to pull request November 20, 2024 11:09 Inactive

sunnyosun added 2 commits November 25, 2024 15:58

Merge branch 'main' into refactor-curator

0d7d677

🎨 Temp

f1469a9

github-actions bot temporarily deployed to pull request November 25, 2024 15:12 Inactive

sunnyosun added 2 commits November 25, 2024 17:29

🎨 Better logging with standardize

9b73bad

🎨 Add standardize to df and adata

a676a36

sunnyosun changed the title ~~♻️ Refactor Curator~~ ♻️ Add .standardize() to Curator Nov 26, 2024

Merge branch 'main' into refactor-curator

61e805c

💚 Default print_values to '

fcdc1f7

github-actions bot temporarily deployed to pull request November 26, 2024 14:05 Inactive

sunnyosun added 8 commits November 26, 2024 16:17

🧪 More tests

b58ae11

💚 Fix tests

a38d998

💚 Fix tests

43a7833

💚 Fix tests

0f93c13

♻️ Mudata

33d1a3a

✏️ Mudata

5d97ba3

💚 Fix

0287218

💚 Fix

a8dad5b

sunnyosun mentioned this pull request Nov 26, 2024

Remove self.kwargs leftover argument #2198

Closed

💚 Fix all tests

948c56c

⚡️ Speed up

b9c2895

sunnyosun mentioned this pull request Nov 27, 2024

⚡ Curator.validate() is too slow #2217

Open

github-actions bot temporarily deployed to pull request November 27, 2024 10:37 Inactive

Zethson added 2 commits November 27, 2024 11:57

🎨 Polish

4227428

Signed-off-by: zethson <[email protected]>

Merge branch 'refactor-curator' of https://github.com/laminlabs/lamindb…

4b23f51

… into refactor-curator

Zethson requested changes Nov 27, 2024

View reviewed changes

🎨 Submodules

d93c0b4

Signed-off-by: zethson <[email protected]>

This comment was marked as outdated.

Sign in to view

sunnyosun added 2 commits November 27, 2024 12:42

🎨 Address comments

33ba7e8

✏️ Fix return type

27df2e1

💚 Fix

e5771f5

📝 Iterate on docs

48d1e42

✏️ Fix

919ac86

github-actions bot temporarily deployed to pull request November 27, 2024 13:31 Inactive

sunnyosun merged commit 52492e7 into main Nov 27, 2024
15 of 16 checks passed

sunnyosun deleted the refactor-curator branch November 27, 2024 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

♻️ Add `.standardize()` to `Curator` and refactor #2186

♻️ Add `.standardize()` to `Curator` and refactor #2186

sunnyosun commented Nov 18, 2024 •

edited by falexwolf

Loading

codecov bot commented Nov 18, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024 •

edited

Loading

falexwolf commented Nov 26, 2024

falexwolf commented Nov 26, 2024

sunnyosun commented Nov 27, 2024

Zethson left a comment •

edited

Loading

Zethson Nov 27, 2024

Zethson Nov 27, 2024

sunnyosun Nov 27, 2024

Zethson Nov 27, 2024

sunnyosun commented Nov 27, 2024 •

edited

Loading

sunnyosun commented Nov 27, 2024

This comment was marked as outdated.

This comment was marked as outdated.

falexwolf commented Nov 27, 2024 •

edited

Loading

Zethson commented Nov 27, 2024

sunnyosun commented Nov 27, 2024 •

edited

Loading

Zethson commented Nov 27, 2024

falexwolf commented Nov 27, 2024 •

edited

Loading

sunnyosun commented Nov 27, 2024

Zethson commented Nov 27, 2024

sunnyosun commented Nov 27, 2024

falexwolf commented Nov 27, 2024

falexwolf commented Nov 28, 2024

		@@ -98,17 +97,45 @@ def mock_transform():


		def test_df_curator(df, categoricals):

♻️ Add .standardize() to Curator and refactor #2186

♻️ Add .standardize() to Curator and refactor #2186

Conversation

sunnyosun commented Nov 18, 2024 • edited by falexwolf Loading