Improved Documentation, Rust Tokenize Shuffle #96

afang-story · 2024-11-18T12:06:21Z

Added rust tokenize shuffle + commands
Explained exp_data
Fixed commands for training and evaluation
Included examples in the above sections

Combined above examples with some additional commands to have a brief list of commands to download DCLM-baseline, tokenize shuffle, train, and evaluate a 1B_1x model.

GeorgiosSmyrnis · 2024-11-20T18:19:45Z

README.md

+```
+
+### Training
+data-config comes from the json created (manually for rust code, automatically for ray) after tokenize shuffle.


Is it possible to incorporate the manifest creation in the above?

We would have to write a separate python script and have a bash script that runs the two together.

GeorgiosSmyrnis · 2024-11-20T18:20:40Z

README.md

+--tokenizer "EleutherAI/gpt-neox-20b" \
+--seqlen 2049 \
+--wds-chunk-size 8192 \
+--num-local-cells 512


Should --num-local-cells be equal to the number of available cores (in which case we should probably decrease the default / add a comment)

No. There is an explanation for this in the rust tokenize shuffle readme, but this default is reasonable.

GeorgiosSmyrnis · 2024-11-20T18:39:51Z

README.md

+### Training
+data-config comes from the json created (manually for rust code, automatically for ray) after tokenize shuffle.
+```bash
+torchrun --nproc-per-node 8 -m training.train --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json --logs dclm_rs_tokshuf_training_local_logs --torchcompile


We should change this to torchrun --nproc-per-node 8 -m training.train -- --scale 1b_1x_fast --data-config exp_data/datasets/tokenized/dclm_gs3_ls1_rs_tokshuf.json --logs dclm_rs_tokshuf_training_local_logs --attn-name torch_attn --torchcompile because it seems that new version of torchrun have an issue with the --logs option (so adding -- makes it clear that this is an option to training.train, not torchrun)

edit: added --attn-name torch_attn as well

…ttn-name

afang-story and others added 5 commits November 18, 2024 01:42

exp_data, rust tokshuf, train, eval, fix commands, 1b example

3814242

Create dclm_gs3_ls1_rs_tokshuf.json

9f44d1a

change to tokshuf-rs within dclm instead of external

b4a5a48

remove 1B examples extra install

878c67f

Update tokshuf-rs README.md

69fc215

afang-story requested a review from GeorgiosSmyrnis November 18, 2024 12:06

Update dclm_gs3_ls1_rs_tokshuf.json num_tokens and size

abc2dec

GeorgiosSmyrnis reviewed Nov 20, 2024

View reviewed changes

Update README.md training commands with -- to separate args and add a…

69d1a8a

…ttn-name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Documentation, Rust Tokenize Shuffle #96

Improved Documentation, Rust Tokenize Shuffle #96

afang-story commented Nov 18, 2024

GeorgiosSmyrnis Nov 20, 2024

afang-story Nov 26, 2024 •

edited

Loading

GeorgiosSmyrnis Nov 20, 2024

afang-story Nov 26, 2024 •

edited

Loading

GeorgiosSmyrnis Nov 20, 2024 •

edited

Loading

afang-story Nov 26, 2024

Improved Documentation, Rust Tokenize Shuffle #96

Are you sure you want to change the base?

Improved Documentation, Rust Tokenize Shuffle #96

Conversation

afang-story commented Nov 18, 2024

GeorgiosSmyrnis Nov 20, 2024

Choose a reason for hiding this comment

afang-story Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

GeorgiosSmyrnis Nov 20, 2024

Choose a reason for hiding this comment

afang-story Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

GeorgiosSmyrnis Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

afang-story Nov 26, 2024

Choose a reason for hiding this comment

afang-story Nov 26, 2024 •

edited

Loading

afang-story Nov 26, 2024 •

edited

Loading

GeorgiosSmyrnis Nov 20, 2024 •

edited

Loading