-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing values count should be performed in unprocessed dataset #57
Comments
Lets tackle this if we have time this sprint. |
Have a quick analysis before the implementation. |
@stavrostheocharis and @sinnec as you asked for a further description. In the src->cron_tasks->monitoring_metrics.py you can find the following function: `async def run_calculate_feature_metrics_pipeline(
As you can see we calculate the feature metrics (e.g. missing_value_count, average, min, max etc.) for the processed inference data. I believe that specifically for the calculation of the count of the missing values we have to perform it on the unprocessed dataset. What do you think? |
Yes, it should be performed at the unprocessed dataset for sure. The only thing here is "where to do it" and then how to save them into the database. More specifically, let's assume that in the function that handles the calculations we both calculate some metrics on the processed dataset and some others for the unprocessed. Then we have to define a way that these metrics can be seen separately for each one because the current implementation saves them all together in the database. So, also an adjustment to the schema may be needed. |
@stavrostheocharis I suggest to discuss it all together on Monday. What do you think? |
As the missing value count is an indicator of missing values in a feature of a dataset, has much more value to be performed in the unprocessed dataset rather than in the processed one - most likely a handling procedure of missing values would had been performed already in the processed dataset.
Also, for monitoring and alerting reason, has more value to count the missing values as a percentage of the total entries of each feature (e.g. if a feature has 100 entries and 3 missing values the missing values count is 3%). This ease a lot the setting of thresholds for the specific metric.
The text was updated successfully, but these errors were encountered: