How to provide validation set partitions #139

fteufel · 2024-07-18T09:34:28Z

Context

Datasets for deep learning typically feature an additional validation/dev partition in addition to train+test to be used to monitor model training. Right now, Polaris only accepts train and test for split when defining a benchmark.

Description

It will be necessary to also provide the validation dataset to the user. How should this be done? Extending the API, or 'abusing' the optional dictionary structure for multiple test sets?

The text was updated successfully, but these errors were encountered:

cwognum · 2024-07-19T00:26:30Z

Hey @fteufel , thanks again! 😄

Great question! We deliberately don't include the specification of a validation set. To us, the choice of validation set is a design decision during method development that we don't want to constrain. For example for hyper-parameter search: Some may argue that a random split for the validation set is best because it results in a more diverse training set, whereas others may argue that some sort of OOD split is best because it selects hyper-parameters that are better suited for generalization. From Polaris' perspective, both are valid design decisions one can make during method development and it should be up to the user to decide.

What do you think of this reasoning? Do you agree? Since this turns into more of a philosophical discussion than a feature request and because I expect this question to come up more often, I suggest we move this conversation to a Github Discussion.

Having said all that, I do think we can do a better job at making it easy to split of a validation set as part of the Polaris API. I'm very much open to suggestions on how you think that could or should work! Maybe something like:

benchmark = po.load_benchmark(...)

train, test = benchmark.get_train_test_split(...)
train, val = train.split(indices=([0, 1, 2], [3]))

fteufel · 2024-07-19T10:18:03Z

I definitely agree that this can be left as a design decision - but if we already have some recommended OOD scheme defined on the data, it would be helpful to have a way to provide it to the user in case they wanna stick with it.

Right now I don't see a good way to provide it - if I understand correctly, when you access the data via the benchmark only, you are only ever exposed to target_cols and input_cols, and no additional info that might be helpful for splitting the validation set.

cwognum · 2024-07-19T11:33:49Z

and no additional info that might be helpful for splitting the validation set.

To make sure I understand: What type of information are you thinking of here?

fteufel · 2024-07-19T11:47:34Z

e.g. some structural/sequence identity clustering of the samples that was used for splitting. So something that would take quite some extra effort to reproduce ad hoc on the train data.

cwognum · 2024-07-19T22:42:13Z

Thank you for sharing some additional information.

Given the reasons outlined above, I still don't think we want to support specifying a fixed validation split. I do think, however, we should work on features that make it easier to split the train set in a train and validation portion (like adding a Subset.split() method as I suggested in the API above). Especially with such enhancements that make splitting the train set easier, I think replicating a validation split is easy enough.

some structural/sequence identity clustering of the samples that was used for splitting.

You could add a column to your dataset that contains that information if it's important! Building on the above API example:

benchmark = po.load_benchmark(...)

train, test = benchmark.get_train_test_split(...)

# Access the underlying dataset
df = benchmark.dataset.table
train_df = df.iloc[train.indices]

# Cluster based on a column in that dataset
clustering = train_df["clustering"]
train_ind = np.where(clustering =="train")[0]
val_ind = np.where(clustering == "val")[0]

# Get the train and validation split 
train, val = train.split(indices=(train_ind, val_ind))

fteufel · 2024-07-20T08:49:24Z

Sure, can do that. The reason why I asked in the first place was that the whole API gave me the impression that you wanted to discourage users from interacting with the actual dataframe directly - buried two levels down in the benchmark object.

If that's not the case, I guess I will just put it there and mention it in the readme.

cwognum · 2024-07-20T09:04:02Z

You're not wrong... This definitely would be a use case for advanced Polaris users, but that may be okay for a use case we don't want to officially support (at least not yet).

An alternative could be to specify it as an additional input column, but that feels a bit hacky. Maybe we could introduce "metadata columns" (in addition to target and input columns)? These would hold data you may want to use during training, but don't have to use? Not sure, feels a little convoluted and confusing as well! What do you think?

cwognum · 2024-07-20T11:02:47Z

Final thought: As you suggested, specifying a second "test" set and naming it validation could work too! I do think that makes it required for any result to submit their results on the validation set as well.

fteufel · 2024-07-20T12:38:54Z

Considering everything, providing it as a second test does sound the cleanest :)

Is there anything to consider to prevent people from then pushing results from the "wrong" test set back to the hub?

cwognum · 2024-07-20T12:44:15Z

You'll see results for both test sets on the leaderboard. You can select which results to view with the dropdown above the leaderboard.

fteufel · 2024-07-20T12:50:28Z

Ah yes. I see in the source that it will be mandatory to also evaluate the validation set then. Not really useful but I guess we can live with that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to provide validation set partitions #139

How to provide validation set partitions #139

fteufel commented Jul 18, 2024

cwognum commented Jul 19, 2024

fteufel commented Jul 19, 2024

cwognum commented Jul 19, 2024 •

edited

Loading

fteufel commented Jul 19, 2024

cwognum commented Jul 19, 2024

fteufel commented Jul 20, 2024

cwognum commented Jul 20, 2024 •

edited

Loading

cwognum commented Jul 20, 2024 •

edited

Loading

fteufel commented Jul 20, 2024

cwognum commented Jul 20, 2024

fteufel commented Jul 20, 2024

How to provide validation set partitions #139

How to provide validation set partitions #139

Comments

fteufel commented Jul 18, 2024

Context

Description

cwognum commented Jul 19, 2024

fteufel commented Jul 19, 2024

cwognum commented Jul 19, 2024 • edited Loading

fteufel commented Jul 19, 2024

cwognum commented Jul 19, 2024

fteufel commented Jul 20, 2024

cwognum commented Jul 20, 2024 • edited Loading

cwognum commented Jul 20, 2024 • edited Loading

fteufel commented Jul 20, 2024

cwognum commented Jul 20, 2024

fteufel commented Jul 20, 2024

cwognum commented Jul 19, 2024 •

edited

Loading

cwognum commented Jul 20, 2024 •

edited

Loading

cwognum commented Jul 20, 2024 •

edited

Loading