Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values count should be performed in unprocessed dataset #57

Open
NickNtamp opened this issue Dec 8, 2022 · 5 comments
Open
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@NickNtamp
Copy link
Contributor

As the missing value count is an indicator of missing values in a feature of a dataset, has much more value to be performed in the unprocessed dataset rather than in the processed one - most likely a handling procedure of missing values would had been performed already in the processed dataset.

Also, for monitoring and alerting reason, has more value to count the missing values as a percentage of the total entries of each feature (e.g. if a feature has 100 entries and 3 missing values the missing values count is 3%). This ease a lot the setting of thresholds for the specific metric.

@momegas
Copy link
Member

momegas commented Dec 14, 2022

Lets tackle this if we have time this sprint.

@stavrostheocharis
Copy link
Contributor

Have a quick analysis before the implementation.

@stavrostheocharis stavrostheocharis added discussion needed This issue needs some discussion to move forward needs analysis This issue needs analysis labels Jan 2, 2023
@NickNtamp
Copy link
Contributor Author

@stavrostheocharis and @sinnec as you asked for a further description.

In the src->cron_tasks->monitoring_metrics.py you can find the following function:

`async def run_calculate_feature_metrics_pipeline(
model: Model, inference_processed_df: pd.DataFrame
):
"""
Run the pipeline to calculate the feature metrics
After the metrics are calculated they are saved in the database
"""

logger.info(f"Calculating feature metrics for model {model.id}")
feature_metrics_report = create_feature_metrics_pipeline(inference_processed_df)

if feature_metrics_report:
    new_feature_metric = ModelIntegrityMetricCreate(
        model_id=model.id,
        timestamp=str(datetime.utcnow()),
        feature_metrics=feature_metrics_report,
    )

    crud.model_integrity_metrics.create(db, obj_in=new_feature_metric)
    logger.info("Feature metrics calcutated!")`

As you can see we calculate the feature metrics (e.g. missing_value_count, average, min, max etc.) for the processed inference data. I believe that specifically for the calculation of the count of the missing values we have to perform it on the unprocessed dataset. What do you think?

@stavrostheocharis
Copy link
Contributor

Yes, it should be performed at the unprocessed dataset for sure. The only thing here is "where to do it" and then how to save them into the database.

More specifically, let's assume that in the function that handles the calculations we both calculate some metrics on the processed dataset and some others for the unprocessed. Then we have to define a way that these metrics can be seen separately for each one because the current implementation saves them all together in the database.

So, also an adjustment to the schema may be needed.

@NickNtamp
Copy link
Contributor Author

@stavrostheocharis I suggest to discuss it all together on Monday. What do you think?

@momegas momegas added the bug Something isn't working label Feb 15, 2023
@momegas momegas added good first issue Good for newcomers and removed discussion needed This issue needs some discussion to move forward needs analysis This issue needs analysis labels Feb 15, 2023
@momegas momegas moved this from Backlog to Planned in 🐻‍❄️ Whitebox - Issue tracking Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants