Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Add .standardize() to Curator and refactor #2186

Merged
merged 35 commits into from
Nov 27, 2024
Merged

Conversation

sunnyosun
Copy link
Member

@sunnyosun sunnyosun commented Nov 18, 2024

Add .standardize() to Curator

Before this PR, a a synonym was considered a valid value for a record. This is no longer the case.

Example: Assume you have a bt.CellType registry with this content:

name synonyms
"T cell" "T-cell"

And now you curate a dataframe against it:

>> df = pd.DataFrame({"cell_type": ["T-cell", "T cell"]})
>> curator = ln.Curator.from_df(df, categoricals={"cell_type": bt.CellType.name}
# before this PR, the below returned `True` if `"T-cell"` is a synonym of `"T cell"` in bt.CellType
# now, it returns False
>>> curator.validate()
• mapping "cell_type" on CellType.name
!   1 term is not validated: 'T-cell'
    1 synonym found: "T-cell""T cell"curate synonyms via .standardize("cell_type")

You can now call curator.standardize() to standardize synonyms.

>>> curator.standardize("cell_type")  # modifies df inplace
>>> # new state of df is pd.DataFrame({"cell_type": ["T cell", "T cell"]})standardized 1 synonym in "cell_type": "T-cell""T cell"

Explain validation concept more clearly in the docs

Before After
image image
/ image

Further changes

  • Refactored Curator to improve code clarity and performance
  • Added more tests for curators

@sunnyosun sunnyosun changed the title 🗑️ Deprecate add_new_from_columns ♻️ Refactor Curator Nov 18, 2024
Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 91.61290% with 13 lines in your changes missing coverage. Please review.

Project coverage is 92.86%. Comparing base (c54f99f) to head (919ac86).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
lamindb/_curate.py 91.50% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2186      +/-   ##
==========================================
+ Coverage   92.36%   92.86%   +0.49%     
==========================================
  Files          54       54              
  Lines        6566     6660      +94     
==========================================
+ Hits         6065     6185     +120     
+ Misses        501      475      -26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Nov 20, 2024

@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 11:09 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 25, 2024 15:12 Inactive
@sunnyosun sunnyosun changed the title ♻️ Refactor Curator ♻️ Add .standardize() to Curator Nov 26, 2024
@falexwolf
Copy link
Member

👋 Are you also going to add the BIG BIG box to the docs with the following definitions?

  1. what a validated label & feature identifier is
  2. what a validated record is
  3. what a validated dataset is (one could also call it curated or standardized dataset as these would be all equivalent)

It'd be also super helpful to explain lamindb's validation & curation by contrasting it with pandera, pydantic and other popular validation frameworks.

@falexwolf
Copy link
Member

@github-actions github-actions bot temporarily deployed to pull request November 26, 2024 14:05 Inactive
@sunnyosun
Copy link
Member Author

Looking into performance now!

@github-actions github-actions bot temporarily deployed to pull request November 27, 2024 10:37 Inactive
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

  1. We are losing about 1.5% coverage with this PR. That's a bit much.
  2. I can't properly comment on the notebooks so I'll make some messy comments here.
    In curate-df:

In LaminDB, validation means verifying that values exist in metadata registries. Validation criteria define which registry fields should contain these values.

Technically correct but if I were to read this for the first time, it would sound a bit lose to me. I would feel more comfortable if it were clearer that in many cases these "values that exist in metadata registries" are based on public ontologies.

I'm not sure how useful distinguishing between "Validated label & feature identifier" and "validated record" is. At least the description didn't give much more away to me.
3. The ux of standardize() is a bit weird to me. I'd consider defaulting the key to all. I'd also consider supporting Iterable. And should it be called key? key is programmers language and maybe we should consider something more meaningful to users.
4. I know that our pipeline tests are easier but they're not best practice. See comments in test section.

I made a commit with some minor changes to not bug you with nitpicks.

lamindb/_from_values.py Outdated Show resolved Hide resolved
@@ -98,17 +97,45 @@ def mock_transform():


def test_df_curator(df, categoricals):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally about test design but we should consider splitting tests a bit more. A general mantra is that a single test tests for a single behavior. We should not have such super tests that are more like pipelines. There's a couple of benefits for this but among them is that the test name and a potential 1 sentence docstring immediately give away what is being tested here.

TLDR: I'd say that with this PR or a followup PR, we should split all of the big curator tests into tiny test functions.

tests/core/test_curate.py Outdated Show resolved Hide resolved
assert validated is False

# deprecated method
curator.add_new_from_columns()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no assert here. It's a bit weird to use it like this and have us programmers guess whether it had the desired effect or not in the subsequent code in this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a deprecated method doesn't do anything but needs to be covered, what do you suggest to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a pragma: no cover is good enough for that function assuming it calls the non-deprecated one that is covered.

If only pragma: no cover worked as expected sigh.

tests/core/test_curate.py Outdated Show resolved Hide resolved
lamindb/_curate.py Show resolved Hide resolved
lamindb/_curate.py Show resolved Hide resolved
lamindb/_curate.py Show resolved Hide resolved
lamindb/_curate.py Outdated Show resolved Hide resolved
lamindb/_curate.py Outdated Show resolved Hide resolved
@sunnyosun
Copy link
Member Author

sunnyosun commented Nov 27, 2024

We are losing about 1.5% coverage with this PR. That's a bit much.

No, I added a lot more tests, we in fact gained 0.5% coverage, the codecov report up there didn't update, see here: https://app.codecov.io/gh/laminlabs/lamindb/pull/2186?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=laminlabs

Signed-off-by: zethson <[email protected]>
@sunnyosun
Copy link
Member Author

  1. The ux of standardize() is a bit weird to me. I'd consider defaulting the key to all. I'd also consider supporting Iterable. And should it be called key? key is programmers language and maybe we should consider something more meaningful to users.

Thought about it, but I think sometimes user wants to standardize in other ways. It's a mutating operation to their dataset, so I want to be careful.

Calling a for loop for .standardize is super easy; I'd not support that to minimize our maintenance.

key is just a generic word I can come up with that can cover all different formats (obs columns, var.index, mudata stuff), I'd like to be more specific if you have a better name.

@sunnyosun

This comment was marked as outdated.

@Zethson

This comment was marked as outdated.

@falexwolf
Copy link
Member

falexwolf commented Nov 27, 2024

I've been confused about the choice Treatment for years meanwhile. It'd be great to have a paragraph somewhere that explains why that choice versus something else. Some people also prefer "intervention" but I think that's less common in biology.

@Zethson
Copy link
Member

Zethson commented Nov 27, 2024

Thought about it, but I think sometimes user wants to standardize in other ways. It's a mutating operation to their dataset, so I want to be careful.

Calling a for loop for .standardize is super easy; I'd not support that to minimize our maintenance.

Well, all is equally as dangerous and we are supporting that. It is easy to call a loop, but it breaks the UX a bit because when using the curator users usually don't write loops for validate() either. It's very cheap for us to support Iterable and I don't think that it adds a lot of maintenance burden? It should be easy to implement and test.

key is just a generic word I can come up with that can cover all different formats (obs columns, var.index, mudata stuff), I'd like to be more specific if you have a better name.

Yeah, that's a lot...maybe:

  1. values_key?
  2. categorical_key? - for consistency with categoricals but it would not cover var_index...
  3. col? var_index is also a column.
  4. col_key

@sunnyosun
Copy link
Member Author

sunnyosun commented Nov 27, 2024

Thought about it, but I think sometimes user wants to standardize in other ways. It's a mutating operation to their dataset, so I want to be careful.
Calling a for loop for .standardize is super easy; I'd not support that to minimize our maintenance.

Well, all is equally as dangerous and we are supporting that. It is easy to call a loop, but it breaks the UX a bit because when using the curator users usually don't write loops for validate() either. It's very cheap for us to support Iterable and I don't think that it adds a lot of maintenance burden? It should be easy to implement and test.

key is just a generic word I can come up with that can cover all different formats (obs columns, var.index, mudata stuff), I'd like to be more specific if you have a better name.

Yeah, that's a lot...maybe:

  1. values_key?
  2. categorical_key? - for consistency with categoricals but it would not cover var_index...
  3. col? var_index is also a column.
  4. col_key

But if it's called key or values_key or anything, it implies it's a singular, making it Iterable is very confusing. At least I always find thing confusing if they accept a sigular and a list at the same time. I think having all is enough to make the UX you described above, an Iterable is overkill and even more confusing to name.

Yes "all" is equally dangerous, that's why I'm not making it a default and we should not advertise using it unless necessary in all our guides.

Anything with col is no good, because it can be "var_index".

I'm not sure if values_key is more clear than key, also it's a bit long.

@Zethson
Copy link
Member

Zethson commented Nov 27, 2024

But if it's called key or values_key or anything, it implies it's a singular, making it Iterable is very confusing. At least I always find thing confusing if they accept a sigular and a list at the same time. I think having all is enough to make the UX you described above, an Iterable is overkill and even more confusing to name.

Yes "all" is equally dangerous, that's why I'm not making it a default and we should not advertise using it unless necessary in all our guides.

I think behavior that I've seen before is that None means all. In other words, if we didn't have the all special parameter and rather defaulted to None which means all of them, we could more easily enable Iterable.

@falexwolf
Copy link
Member

falexwolf commented Nov 27, 2024

There is a lot of stuff going on here and I'll not read through all code and all comments unless you ask me to.

So, I'll just focus on getting the "what the hell does validation even mean?" question right. The first 5 min in which the user gets confronted with the topic "Curating datasets".

I pushed a commit that effects the following changes. We can iterate from there or revert some of it.

Before After
image image
image image

I noticed that sometimes we say "validation criteria" and sometimes "validation constraint". We should only use one of them. Which?

Shall we combine the example we have on the "curate" docs page with the one we have on the quickstart (below an updated version) and also throw in a datetime column?

df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"], "cell_type_by_expert": ["B cell", "T cell", "T cell"], "cell_type_by_model": ["B cell", "T cell", "T cell"]},
    index=["sample1", "sample2", "sample3"],
)
adata1 = ad.AnnData(
    df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation", "cell_type_by_expert", "cell_type_by_model"]]
)
curator = ln.Curator.from_anndata(adata1, var_index=bt.Gene.symbol, categoricals={"perturbation": ln.ULabel.name, "cell_type_by_expert": bt.CellType.name,  "cell_type_by_model": bt.CellType.name}, organism="human")
curator.add_new_from("perturbation")
artifact1 = curator.save_artifact(key="example_datasets/dataset1.h5ad")

And, shall we first study the case where everything passes so that it doesn't look so intimidating? I mean, people will use the logging to resolve issues and not read the docs for this anyway I imagine.

We have to be careful with what we put here vs. what we put on the quickstart. But that that can be another iteration.

@sunnyosun
Copy link
Member Author

But if it's called key or values_key or anything, it implies it's a singular, making it Iterable is very confusing. At least I always find thing confusing if they accept a sigular and a list at the same time. I think having all is enough to make the UX you described above, an Iterable is overkill and even more confusing to name.
Yes "all" is equally dangerous, that's why I'm not making it a default and we should not advertise using it unless necessary in all our guides.

I think behavior that I've seen before is that None means all. In other words, if we didn't have the all special parameter and rather defaulted to None which means all of them, we could more easily enable Iterable.

If all is the default, everyone will start calling it without passing anything and without thinking. I want users to not do that, but instead actually curate what they think should be standardized. key is not needed at all if we introduce all as default.

Again, I think users should not do things blindly when it comes to add_new_from and standardize.

@github-actions github-actions bot temporarily deployed to pull request November 27, 2024 13:31 Inactive
@sunnyosun sunnyosun merged commit 52492e7 into main Nov 27, 2024
15 of 16 checks passed
@sunnyosun sunnyosun deleted the refactor-curator branch November 27, 2024 13:57
@Zethson
Copy link
Member

Zethson commented Nov 27, 2024

Again, I think users should not do things blindly when it comes to add_new_from and standardize.

I agree with add_new_from 100%. But standardize() should be fine. Yes, it's inplace but it only matches values that are in the instance or the public ontologies. I don't see the big risks here.

We can gather some feedback from our users first and then still act later if necessary.

@sunnyosun
Copy link
Member Author

Again, I think users should not do things blindly when it comes to add_new_from and standardize.

I agree with add_new_from 100%. But standardize() should be fine. Yes, it's inplace but it only matches values that are in the instance or the public ontologies. I don't see the big risks here.

We can gather some feedback from our users first and then still act later if necessary.

For example, Felix's issue with some numbers in synonyms will be a disaster if applied blindly. So I think users still need to be careful with standardize.

@falexwolf
Copy link
Member

Can this PR get a much more comprehensive description with examples and screenshots etc.?

This seems cryptic
image

And it seems to miss the biggest change that .standardize() is actually called by default and might mutate the object that was passed.

It's very important to have clear logging in case the object is mutated.

Can you please add an example for a small dataframe that is mutated via of synonyms-mapping and then issues this logging message?

I hope you can just use one that's in the tests.

@falexwolf
Copy link
Member

Ping @sunnyosun! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants