-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to provide validation set partitions #139
Comments
Hey @fteufel , thanks again! 😄 Great question! We deliberately don't include the specification of a validation set. To us, the choice of validation set is a design decision during method development that we don't want to constrain. For example for hyper-parameter search: Some may argue that a random split for the validation set is best because it results in a more diverse training set, whereas others may argue that some sort of OOD split is best because it selects hyper-parameters that are better suited for generalization. From Polaris' perspective, both are valid design decisions one can make during method development and it should be up to the user to decide. What do you think of this reasoning? Do you agree? Since this turns into more of a philosophical discussion than a feature request and because I expect this question to come up more often, I suggest we move this conversation to a Github Discussion. Having said all that, I do think we can do a better job at making it easy to split of a validation set as part of the Polaris API. I'm very much open to suggestions on how you think that could or should work! Maybe something like: benchmark = po.load_benchmark(...)
train, test = benchmark.get_train_test_split(...)
train, val = train.split(indices=([0, 1, 2], [3])) |
I definitely agree that this can be left as a design decision - but if we already have some recommended OOD scheme defined on the data, it would be helpful to have a way to provide it to the user in case they wanna stick with it. Right now I don't see a good way to provide it - if I understand correctly, when you access the data via the benchmark only, you are only ever exposed to |
To make sure I understand: What type of information are you thinking of here? |
e.g. some structural/sequence identity clustering of the samples that was used for splitting. So something that would take quite some extra effort to reproduce ad hoc on the train data. |
Thank you for sharing some additional information. Given the reasons outlined above, I still don't think we want to support specifying a fixed validation split. I do think, however, we should work on features that make it easier to split the train set in a train and validation portion (like adding a
You could add a column to your dataset that contains that information if it's important! Building on the above API example: benchmark = po.load_benchmark(...)
train, test = benchmark.get_train_test_split(...)
# Access the underlying dataset
df = benchmark.dataset.table
train_df = df.iloc[train.indices]
# Cluster based on a column in that dataset
clustering = train_df["clustering"]
train_ind = np.where(clustering =="train")[0]
val_ind = np.where(clustering == "val")[0]
# Get the train and validation split
train, val = train.split(indices=(train_ind, val_ind)) |
Sure, can do that. The reason why I asked in the first place was that the whole API gave me the impression that you wanted to discourage users from interacting with the actual dataframe directly - buried two levels down in the benchmark object. If that's not the case, I guess I will just put it there and mention it in the readme. |
You're not wrong... This definitely would be a use case for advanced Polaris users, but that may be okay for a use case we don't want to officially support (at least not yet). An alternative could be to specify it as an additional input column, but that feels a bit hacky. Maybe we could introduce "metadata columns" (in addition to target and input columns)? These would hold data you may want to use during training, but don't have to use? Not sure, feels a little convoluted and confusing as well! What do you think? |
Final thought: As you suggested, specifying a second "test" set and naming it validation could work too! I do think that makes it required for any result to submit their results on the validation set as well. |
Considering everything, providing it as a second test does sound the cleanest :) Is there anything to consider to prevent people from then pushing results from the "wrong" test set back to the hub? |
You'll see results for both test sets on the leaderboard. You can select which results to view with the dropdown above the leaderboard. |
Ah yes. I see in the source that it will be mandatory to also evaluate the validation set then. Not really useful but I guess we can live with that. |
Context
Datasets for deep learning typically feature an additional validation/dev partition in addition to train+test to be used to monitor model training. Right now, Polaris only accepts train and test for
split
when defining a benchmark.Description
It will be necessary to also provide the validation dataset to the user. How should this be done? Extending the API, or 'abusing' the optional dictionary structure for multiple test sets?
The text was updated successfully, but these errors were encountered: