You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ git clone https://github.com/maxrousseau/rafale.git
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt
$ uv pip install -e .
$ rafale-run test/pythia_tinystories.yaml
$ # cancel the current run
$ rafale-run test/pythia_tinystories.yaml # resumes from the "latest" checkpoint
Expected behavior
Near exact continuation of the training loss curve compared to the uninterrupted run. After the second or third resumptions, the loss begins to diverge (see plot below). I suspect that maybe gradient accumulation is causing an issue where the gradients are not stored in the checkpoint or that we are restarting mid-batch (and the accumulated gradients are lost) ?
Note: purple is the uninterrupted run which has lower training loss.
Additional context
I am using device_microbatch_size="auto" for my training run the configuration of the run is the following:
run:
name: "pythia14m-tinystories"# name of your experiment, used for checkpointingseed: 42n_epochs: 1max_lr: 6e-04warmup_pct: 0.01schedule: "cosine-warmup"# linear, linear-warmup, cosine, cosine-warmupoptimizer: "AdamW"eval_interval: "100ba"clip_type: "norm"clip_value: 1.0device_bs: "auto"save_interval: "50ba"train_key: "train"eval_key: "validation"model:
config: "pythia14m"# config keytype: "decoder"use_pretrained: True# mode: None# n_classes: Nonedata:
pipeline: "tinystories_neox"# the preprocessing/tokenization pipelineconfig:
name: "tinystories"num_processes: 8tokenizer_name: "neox"shuffle_dataset: True # this will shufflle the whole training dataset onceinput_id_key: "input_ids"train_batch_size: 1024eval_batch_size: 16shuffle_train: Falsedataset_path: "~/code/data/TinyStories"tokenizer_path: "EleutherAI/pythia-14m"max_sequence_length: 512pad_token_id: -100pad_inputs: Trueis_prepared: Falsesubset_key_mappings: { "train": "train", "validation": "validation" } # (source: target)
The text was updated successfully, but these errors were encountered:
Hi, thanks for replying! No I have not tried using llm-foundry, I am developing my own infra for small-scale experiments using composer to have as much flexibility as possible (i.e. not limited to training LLMs).
Sorry if I was unclear about the mid-batch thing. It is not clear to me what a "step" corresponds to a) a full dataloader batch (in my case 1024 examples) or b) a device_microbatch (here 32 ro 64 samples on my RTX3090). In a case where it is the latter then it could be the source of the problem.
IIUC, gradient accumulation is not the issue as in the event loop of composer we only checkpoint after the gradient accumulation is done and the optimizer has taken it's step. Another way of saying it is, a step should be considered a full dataloader batch.
To test this hypothesis, you can set auto to a value divisible by your save_interval and you should still be able to reproduce the non-determinism. If not, there is likely a bug on our side and we would need to investigate further on our recent changes.
Alternatively to fix your non-determinism, I would double check if you are running forward passes with the exact same data points at each checkpoint resumption. We use streaming to ensure that our samples are exactly the same upon resumption.
** Environment **
composer version 0.26.0
torch 2.4.0
** To reproduce
Steps to reproduce the behavior:
$ git clone https://github.com/maxrousseau/rafale.git
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt
$ uv pip install -e .
$ rafale-run test/pythia_tinystories.yaml
$ # cancel the current run
$ rafale-run test/pythia_tinystories.yaml # resumes from the "latest" checkpoint
Expected behavior
Near exact continuation of the training loss curve compared to the uninterrupted run. After the second or third resumptions, the loss begins to diverge (see plot below). I suspect that maybe gradient accumulation is causing an issue where the gradients are not stored in the checkpoint or that we are restarting mid-batch (and the accumulated gradients are lost) ?
Note: purple is the uninterrupted run which has lower training loss.
Additional context
I am using device_microbatch_size="auto" for my training run the configuration of the run is the following:
The text was updated successfully, but these errors were encountered: