-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Documentation, Rust Tokenize Shuffle #96
base: main
Are you sure you want to change the base?
Conversation
``` | ||
|
||
### Training | ||
data-config comes from the json created (manually for rust code, automatically for ray) after tokenize shuffle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to incorporate the manifest creation in the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to write a separate python script and have a bash script that runs the two together.
--tokenizer "EleutherAI/gpt-neox-20b" \ | ||
--seqlen 2049 \ | ||
--wds-chunk-size 8192 \ | ||
--num-local-cells 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should --num-local-cells
be equal to the number of available cores (in which case we should probably decrease the default / add a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. There is an explanation for this in the rust tokenize shuffle readme, but this default is reasonable.
README.md
Outdated
### Training | ||
data-config comes from the json created (manually for rust code, automatically for ray) after tokenize shuffle. | ||
```bash | ||
torchrun --nproc-per-node 8 -m training.train --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json --logs dclm_rs_tokshuf_training_local_logs --torchcompile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change this to torchrun --nproc-per-node 8 -m training.train -- --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json --logs dclm_rs_tokshuf_training_local_logs --attn-name torch_attn --torchcompile
because it seems that new version of torchrun have an issue with the --logs
option (so adding --
makes it clear that this is an option to training.train
, not torchrun)
edit: added --attn-name torch_attn
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Added rust tokenize shuffle + commands
Explained exp_data
Fixed commands for training and evaluation
Included examples in the above sections
Combined above examples with some additional commands to have a brief list of commands to download DCLM-baseline, tokenize shuffle, train, and evaluate a 1B_1x model.