Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Documentation, Rust Tokenize Shuffle #96

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

afang-story
Copy link
Contributor

Added rust tokenize shuffle + commands
Explained exp_data
Fixed commands for training and evaluation
Included examples in the above sections

Combined above examples with some additional commands to have a brief list of commands to download DCLM-baseline, tokenize shuffle, train, and evaluate a 1B_1x model.

```

### Training
data-config comes from the json created (manually for rust code, automatically for ray) after tokenize shuffle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to incorporate the manifest creation in the above?

Copy link
Contributor Author

@afang-story afang-story Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to write a separate python script and have a bash script that runs the two together.

--tokenizer "EleutherAI/gpt-neox-20b" \
--seqlen 2049 \
--wds-chunk-size 8192 \
--num-local-cells 512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should --num-local-cells be equal to the number of available cores (in which case we should probably decrease the default / add a comment)

Copy link
Contributor Author

@afang-story afang-story Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. There is an explanation for this in the rust tokenize shuffle readme, but this default is reasonable.

README.md Outdated
### Training
data-config comes from the json created (manually for rust code, automatically for ray) after tokenize shuffle.
```bash
torchrun --nproc-per-node 8 -m training.train --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json --logs dclm_rs_tokshuf_training_local_logs --torchcompile
Copy link
Contributor

@GeorgiosSmyrnis GeorgiosSmyrnis Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to torchrun --nproc-per-node 8 -m training.train -- --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json --logs dclm_rs_tokshuf_training_local_logs --attn-name torch_attn --torchcompile because it seems that new version of torchrun have an issue with the --logs option (so adding -- makes it clear that this is an option to training.train, not torchrun)

edit: added --attn-name torch_attn as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants