Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k-fold CV #1574

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

k-fold CV #1574

wants to merge 5 commits into from

Conversation

mmosc
Copy link

@mmosc mmosc commented Dec 6, 2022

Hi!

For my own research, I implemented k-fold CV in your library. Have a look and let me know if it looks good for you! Maybe someone else will also need it.

Cheers,
Marta

@AoiDragon AoiDragon self-assigned this Dec 12, 2022
@AoiDragon
Copy link
Collaborator

@mmosc Hello, thanks for your attention to RecBole.

I tested your code and found some questions. K-fold CV means seperate the dataset into K groups and chooses one group as the test set in turn. However, I found that your code only divides the dataset and fixes the test set. So actually the code works like a RS split which have a different split ratio.

@mmosc
Copy link
Author

mmosc commented Dec 15, 2022

Hi @AoiDragon !
Thanks for having a look at this.

Actually when calling data_preparation in utils.py, the data gets split in folds, and depending on the value of k = config["eval_args"]["fold"] in the overall.yaml, those get rolled in a round robin fashion. After rolling, valid_dataset becomes the second to last fold, and test_dataset becomes the last fold.

So overall, depending on config["eval_args"]["fold"], you are choosing different folds for val and test.

Do you agree on this?

@AoiDragon
Copy link
Collaborator

Hello @mmosc ,

In my understanding, your method need to call data_preparation mutiple times. However, in default quick_start the data_preparation will only be called once, so under default configuration it just divides the dataset once and fixes train, valid and test set. This may mislead users and I didn't see code about this part, so I'm afraid this version of your commit is unsuitable for merging.

@mmosc
Copy link
Author

mmosc commented Dec 18, 2022

Hi @AoiDragon ,
When calling data_preparation with KF as parameter, the following code gets executed:

        if list(config["eval_args"]["split"].keys())[0] == 'KF':
            print("==KF")
            folds = dataset.build()# data_preparation(config, dataset)
            n_folds = len(folds)
            print(n_folds)
            k = config["eval_args"]["fold"]
            folds = folds[k:] + folds[:k]

The k folds are generated when calling dataset.build(), in which the following happens:

        elif split_mode == "KF":
            """
            Will return n_folds datasets
            """
            if not isinstance(split_args["KF"], list):
                raise ValueError(f'The value of "KF" [{split_args}] should be a list.')
            if group_by is None or group_by.lower() == "none":
                datasets = self.split_by_ratio(split_args["KF"], group_by=None)
            elif group_by == "user":
                datasets = self.split_by_ratio(
                    split_args["KF"], group_by=self.uid_field
                )
            else:
                raise NotImplementedError(
                    f"The grouping method [{group_by}] has not been implemented."
                )

This is where split_by_ratio is called (only once) with the percentages of the k-folds as argument. This is the part that will return as many folds as passed by the list with percentages. Calling it only once suffices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants