Remove pickle from data formats, memmap for Tabular predictions #45

geoalgo · 2023-10-17T16:24:35Z

WIP, the evaluation works from scratch and takes much less memory but I am still checking a few things.

A few notes:

the model predictions folder for each task contains now memmap predictions (.dat), labels (.csv.zip) and metadata.json
those files are downloaded as before by the context
there is now a file task_metrics.csv that contains all metrics for all tasks
tabular predictions can still be instantiated in memory by the same dictionary as before
I have not yet ported all the dense utils tests as some rely on gaps between validation/test which are not possible anymore
I have deleted all the pickle format files for predictions, groundtruth and metadata as we should avoid them

data/metadata/task_metric_names.csv

Innixma · 2023-10-17T21:11:12Z

tabrepo/simulation/ensemble_selection_config_scorer.py

 class EnsembleScorer:
    def __init__(self,
                 zeroshot_pp: TabularModelPredictions | dict,
-                 zeroshot_gt: Dict[str, Dict[int, Dict[str, Any]]],
+                 tid_to_dataset_dict: dict,


tid_to_dataset_dict should not be passed here. Instead, dataset should always be passed to evaluate_task as was done before. self.zeroshot_gt and self.zeroshot_pp should not use different keys to fetch the required information.

Previously EnsembleScorer had no concept of tid, which was the main point of the class, to remove the need to worry about tid and dataset conversions.

For most of this, I can handle, but please align self.zeroshot_gt and self.zeroshot_pp to use the same key (probably dataset is best)

Instead, dataset should always be passed to evaluate_task as was done before. self.zeroshot_gt and self.zeroshot_pp should not use different keys to fetch the required information.

I did not change how evaluate_task is called, I just renamed dataset to tid as the field that was passed was effectively tid.

Previously zeroshot_pp had the dictionary internally (with a method rename_datasets) so the mapping is not added, it is just removed from zeroshot_pp.

I also think it would be good to use tid: int or dataset: str consistently when getting ground-truth, predictions etc but it is quite involved and I think it is probably out of scope for this PR.

Previously EnsembleScorer had no concept of tid, which was the main point of the class, to remove the need to worry about tid and dataset conversions.

Reading this one, it seems that you mean that the name dataset should be kept, that one can be done but we have to reintroduce rename_dict or update the groundtruth class, I will take a look if it is possible without too much change.

I took a look and it is not straight-forward to unify dataset/tids in zeroshot_gt and zeroshot_pp. Of-course just unifying those two classes is trivial but the difficulty is with integrating all classes with a new naming. This is because right now, data is accessed in many different ways in different part of the code and classes (for instance tasks are addressed by a string f"{tid}_{fold}" by a float 123.4 where 123 is the task and 4 is the fold or by a tuple, by dataset or by tid).

This PR is already covering a lot by revamping all the pickles data structures and associated classes and updating all context files and I think it should be the content for another PR (which would for instance always address data with tid and fold by a tuple which would make the change much easier but would have to touch all the classes).

Sure, I can try to see if it can be unified in a separate PR.

Innixma · 2023-10-17T21:25:06Z

tabrepo/simulation/tabular_predictions.py

-        if rename_dict_inv:
-            dataset = rename_dict_inv.get(dataset, dataset)
-        return str(Path(output_dir) / f'{dataset}' / f'{fold}.pkl')
+class TabularPredictionsMemmap(TabularModelPredictions):


Please include a conversion method from TabularPredictionsMemmap to TabularPredictionsInMemory. It is easy to convert TabularPredictionsInMemory to TabularPredictionsMemmap, but the same is not currently true for the reverse.

Sure, I have added one.

Innixma

Added a few fixes, LGTM!

Remove pickle from data formats, memmap for Tabular predictions

f953749

geoalgo requested a review from Innixma October 17, 2023 16:24

fix ci issue

86a205f

Innixma reviewed Oct 17, 2023

View reviewed changes

data/metadata/task_metric_names.csv Show resolved Hide resolved

Innixma reviewed Oct 17, 2023

View reviewed changes

Innixma and others added 4 commits October 17, 2023 21:27

fix

3704efc

add to_dict in tabular predictions

1cf806e

Fix

9b899c3

Fix incorrect to_dict

c455f62

Innixma approved these changes Oct 19, 2023

View reviewed changes

minor fix

b4f12ae

Innixma merged commit 17a71f7 into main Oct 19, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove pickle from data formats, memmap for Tabular predictions #45

Remove pickle from data formats, memmap for Tabular predictions #45

geoalgo commented Oct 17, 2023

Innixma Oct 17, 2023

Innixma Oct 17, 2023

Innixma Oct 17, 2023

geoalgo Oct 18, 2023

geoalgo Oct 18, 2023

geoalgo Oct 18, 2023

Innixma Oct 18, 2023

Innixma Oct 17, 2023

geoalgo Oct 18, 2023

Innixma left a comment

Remove pickle from data formats, memmap for Tabular predictions #45

Remove pickle from data formats, memmap for Tabular predictions #45

Conversation

geoalgo commented Oct 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Innixma left a comment

Choose a reason for hiding this comment