-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ Add .standardize()
to Curator
and refactor
#2186
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2186 +/- ##
==========================================
+ Coverage 92.36% 92.86% +0.49%
==========================================
Files 54 54
Lines 6566 6660 +94
==========================================
+ Hits 6065 6185 +120
+ Misses 501 475 -26 ☔ View full report in Codecov by Sentry. |
👋 Are you also going to add the BIG BIG box to the docs with the following definitions?
It'd be also super helpful to explain lamindb's validation & curation by contrasting it with pandera, pydantic and other popular validation frameworks. |
And is this a good time to look into? |
Looking into performance now! |
Signed-off-by: zethson <[email protected]>
… into refactor-curator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
- We are losing about 1.5% coverage with this PR. That's a bit much.
- I can't properly comment on the notebooks so I'll make some messy comments here.
Incurate-df
:
In LaminDB, validation means verifying that values exist in metadata registries. Validation criteria define which registry fields should contain these values.
Technically correct but if I were to read this for the first time, it would sound a bit lose to me. I would feel more comfortable if it were clearer that in many cases these "values that exist in metadata registries" are based on public ontologies.
I'm not sure how useful distinguishing between "Validated label & feature identifier" and "validated record" is. At least the description didn't give much more away to me.
3. The ux
of standardize()
is a bit weird to me. I'd consider defaulting the key
to all
. I'd also consider supporting Iterable
. And should it be called key
? key
is programmers language and maybe we should consider something more meaningful to users.
4. I know that our pipeline tests are easier but they're not best practice. See comments in test section.
I made a commit with some minor changes to not bug you with nitpicks.
@@ -98,17 +97,45 @@ def mock_transform(): | |||
|
|||
|
|||
def test_df_curator(df, categoricals): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally about test design but we should consider splitting tests a bit more. A general mantra is that a single test tests for a single behavior. We should not have such super tests that are more like pipelines. There's a couple of benefits for this but among them is that the test name and a potential 1 sentence docstring immediately give away what is being tested here.
TLDR: I'd say that with this PR or a followup PR, we should split all of the big curator tests into tiny test functions.
assert validated is False | ||
|
||
# deprecated method | ||
curator.add_new_from_columns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no assert here. It's a bit weird to use it like this and have us programmers guess whether it had the desired effect or not in the subsequent code in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a deprecated method doesn't do anything but needs to be covered, what do you suggest to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a pragma: no cover
is good enough for that function assuming it calls the non-deprecated one that is covered.
If only pragma: no cover
worked as expected sigh.
No, I added a lot more tests, we in fact gained 0.5% coverage, the codecov report up there didn't update, see here: https://app.codecov.io/gh/laminlabs/lamindb/pull/2186?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=laminlabs |
Signed-off-by: zethson <[email protected]>
Thought about it, but I think sometimes user wants to standardize in other ways. It's a mutating operation to their dataset, so I want to be careful. Calling a for loop for
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I've been confused about the choice |
Well,
Yeah, that's a lot...maybe:
|
But if it's called Yes "all" is equally dangerous, that's why I'm not making it a default and we should not advertise using it unless necessary in all our guides. Anything with I'm not sure if |
I think behavior that I've seen before is that |
If Again, I think users should not do things blindly when it comes to |
I agree with We can gather some feedback from our users first and then still act later if necessary. |
For example, Felix's issue with some numbers in synonyms will be a disaster if applied blindly. So I think users still need to be careful with |
Can this PR get a much more comprehensive description with examples and screenshots etc.? And it seems to miss the biggest change that It's very important to have clear logging in case the object is mutated. Can you please add an example for a small dataframe that is mutated via of synonyms-mapping and then issues this logging message? I hope you can just use one that's in the tests. |
Ping @sunnyosun! :D |
Add
.standardize()
toCurator
Before this PR, a a synonym was considered a valid value for a record. This is no longer the case.
Example: Assume you have a
bt.CellType
registry with this content:And now you curate a dataframe against it:
You can now call
curator.standardize()
to standardize synonyms.Explain validation concept more clearly in the docs
Further changes
Curator
to improve code clarity and performance