Missing values count should be performed in unprocessed dataset #57

NickNtamp · 2022-12-08T11:08:44Z

As the missing value count is an indicator of missing values in a feature of a dataset, has much more value to be performed in the unprocessed dataset rather than in the processed one - most likely a handling procedure of missing values would had been performed already in the processed dataset.

Also, for monitoring and alerting reason, has more value to count the missing values as a percentage of the total entries of each feature (e.g. if a feature has 100 entries and 3 missing values the missing values count is 3%). This ease a lot the setting of thresholds for the specific metric.

momegas · 2022-12-14T09:32:59Z

Lets tackle this if we have time this sprint.

stavrostheocharis · 2023-01-02T12:16:06Z

Have a quick analysis before the implementation.

NickNtamp · 2023-01-03T08:46:37Z

@stavrostheocharis and @sinnec as you asked for a further description.

In the src->cron_tasks->monitoring_metrics.py you can find the following function:

`async def run_calculate_feature_metrics_pipeline(
model: Model, inference_processed_df: pd.DataFrame
):
"""
Run the pipeline to calculate the feature metrics
After the metrics are calculated they are saved in the database
"""

logger.info(f"Calculating feature metrics for model {model.id}")
feature_metrics_report = create_feature_metrics_pipeline(inference_processed_df)

if feature_metrics_report:
    new_feature_metric = ModelIntegrityMetricCreate(
        model_id=model.id,
        timestamp=str(datetime.utcnow()),
        feature_metrics=feature_metrics_report,
    )

    crud.model_integrity_metrics.create(db, obj_in=new_feature_metric)
    logger.info("Feature metrics calcutated!")`

As you can see we calculate the feature metrics (e.g. missing_value_count, average, min, max etc.) for the processed inference data. I believe that specifically for the calculation of the count of the missing values we have to perform it on the unprocessed dataset. What do you think?

stavrostheocharis · 2023-01-04T11:47:24Z

Yes, it should be performed at the unprocessed dataset for sure. The only thing here is "where to do it" and then how to save them into the database.

More specifically, let's assume that in the function that handles the calculations we both calculate some metrics on the processed dataset and some others for the unprocessed. Then we have to define a way that these metrics can be seen separately for each one because the current implementation saves them all together in the database.

So, also an adjustment to the schema may be needed.

NickNtamp · 2023-01-04T12:02:45Z

@stavrostheocharis I suggest to discuss it all together on Monday. What do you think?

stavrostheocharis assigned NickNtamp Jan 2, 2023

stavrostheocharis added discussion needed This issue needs some discussion to move forward needs analysis This issue needs analysis labels Jan 2, 2023

momegas unassigned NickNtamp Jan 16, 2023

momegas added this to 🐻‍❄️ Whitebox - Issue tracking Feb 15, 2023

momegas moved this to Backlog in 🐻‍❄️ Whitebox - Issue tracking Feb 15, 2023

momegas added the bug Something isn't working label Feb 15, 2023

momegas modified the milestones: 🐻‍❄️ Image models and data monitoring, 🐻‍❄️ Machine learning and data monitoring Feb 15, 2023

momegas added good first issue Good for newcomers and removed discussion needed This issue needs some discussion to move forward needs analysis This issue needs analysis labels Feb 15, 2023

momegas moved this from Backlog to Planned in 🐻‍❄️ Whitebox - Issue tracking Feb 17, 2023

momegas assigned iamtheef Feb 28, 2023

momegas unassigned iamtheef Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values count should be performed in unprocessed dataset #57

Missing values count should be performed in unprocessed dataset #57

NickNtamp commented Dec 8, 2022

momegas commented Dec 14, 2022

stavrostheocharis commented Jan 2, 2023

NickNtamp commented Jan 3, 2023

stavrostheocharis commented Jan 4, 2023

NickNtamp commented Jan 4, 2023

Missing values count should be performed in unprocessed dataset #57

Missing values count should be performed in unprocessed dataset #57

Comments

NickNtamp commented Dec 8, 2022

momegas commented Dec 14, 2022

stavrostheocharis commented Jan 2, 2023

NickNtamp commented Jan 3, 2023

stavrostheocharis commented Jan 4, 2023

NickNtamp commented Jan 4, 2023