Add data checkpoint within epoch feature #17

floatingbigcat · 2023-06-09T05:12:36Z

Description

In the case when our dataset is super large, and we want to let the model walk through the dataset without replacement, may only for one or few epochs.
We can't do the training with oneshot due to time limition wall for each job. We need to add support to let the model dataloader recover from certain iter (within one epoch)

Solution
open_clip has give a solution that slice all shards into many sub set. And for each "sub_epoch" it walk through one sub set. Record our sub_epoch number and use it when start training to do the data checkpoint.
mlfoundations/open_clip#535

floatingbigcat · 2023-09-24T07:39:18Z

#29

floatingbigcat added the enhancement New feature or request label Jun 9, 2023

floatingbigcat self-assigned this Jun 9, 2023

kshitijkg added this to the Robin V1 milestone Jun 16, 2023

kshitijkg removed this from the Robin V1 milestone Jul 6, 2023

kshitijkg added this to the Robin V1 milestone Aug 7, 2023

floatingbigcat closed this as completed Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data checkpoint within epoch feature #17

Add data checkpoint within epoch feature #17

floatingbigcat commented Jun 9, 2023

floatingbigcat commented Sep 24, 2023

Add data checkpoint within epoch feature #17

Add data checkpoint within epoch feature #17

Comments

floatingbigcat commented Jun 9, 2023

floatingbigcat commented Sep 24, 2023