Add OPT model implementation, OPT model loading functionality from huggingface, and training OPT models with FSDP #4

tigranfah · 2024-08-21T19:28:49Z

Example run output

$ CONFIG_FILE="./train_configs/galactica_125m.toml" ./run_llama_train.sh
+ NGPU=2
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/galactica_125m.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ torchrun --nproc_per_node=2 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/galactica_125m.toml
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779]
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779] *****************************************
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779] *****************************************
[rank0]:2024-08-21 23:26:39,933 - root - INFO - Starting job: Galactica debug training
[rank0]:2024-08-21 23:26:39,969 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-08-21 23:26:40,053 - root - INFO - GPU capacity: NVIDIA RTX A6000 (0) with 47.53GiB memory
[rank0]:2024-08-21 23:26:40,053 - root - INFO - Building 1-D device mesh with ['dp'], [2]
[rank0]:2024-08-21 23:26:40,054 - root - INFO - Building 1-D device mesh with ['dp'], [2]
[rank0]:2024-08-21 23:26:40,054 - root - INFO - Building tiktoken tokenizer locally from ./test/assets/test_tiktoken.model
[rank0]:2024-08-21 23:26:40,065 - root - INFO - TikTokenizer built: #words 2256, BOS ID 2000, EOS ID 2001
[rank0]:2024-08-21 23:26:40,065 - root - INFO - Preparing c4_test dataset from test/assets/c4_test
[rank0]:2024-08-21 23:26:40,089 - root - INFO - Building opt 125M with ModelArgs(dim=768, n_layers=12, n_heads=12, n_kv_heads=None, vocab_size=50000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, rope_theta=10000, dropout_p=0.1, max_batch_size=32, max_seq_len=2048, depth_init=True, norm_type='layernorm_bias')
[rank0]:2024-08-21 23:26:43,588 - root - INFO - Model opt 125M size: 163,430,400 total parameters
[rank0]:2024-08-21 23:26:43,588 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-08-21 23:26:43,884 - root - INFO - Applied FSDP to the model
[rank0]:2024-08-21 23:26:43,898 - root - INFO - GPU memory usage for model: 0.63GiB(1.33%)
[rank0]:2024-08-21 23:26:43,898 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240821-2326
[rank0]:2024-08-21 23:26:43,899 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 16, sequence length 2048, total steps 10 (warmup 2)
[rank0]:2024-08-21 23:26:43,899 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:/auto/home/tigranfahradyan/miniforge3/envs/titan/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
[rank0]:  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank0]:2024-08-21 23:26:45,523 - root - INFO - step:  1  loss: 11.8540  memory: 14.85GiB(31.24%)  wps: 10,093  mfu: 3.16%
[rank0]:2024-08-21 23:26:45,523 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-08-21 23:26:46,049 - root - INFO - step:  2  loss: 10.5006  memory: 14.85GiB(31.24%)  wps: 31,181  mfu: 9.76%
[rank0]:2024-08-21 23:26:46,713 - root - INFO - step:  3  loss: 14.6663  memory: 14.85GiB(31.24%)  wps: 24,699  mfu: 7.73%
[rank0]:2024-08-21 23:26:47,330 - root - INFO - step:  4  loss:  9.0398  memory: 14.85GiB(31.24%)  wps: 26,560  mfu: 8.31%
[rank0]:2024-08-21 23:26:48,147 - root - INFO - step:  5  loss:  9.3166  memory: 14.85GiB(31.24%)  wps: 20,074  mfu: 6.28%
[rank0]:2024-08-21 23:26:48,670 - root - INFO - step:  6  loss:  8.3659  memory: 14.85GiB(31.24%)  wps: 31,385  mfu: 9.82%
[rank0]:2024-08-21 23:26:49,406 - root - INFO - step:  7  loss:  7.8526  memory: 14.85GiB(31.24%)  wps: 22,260  mfu: 6.97%
[rank0]:2024-08-21 23:26:50,045 - root - INFO - step:  8  loss:  7.6558  memory: 14.85GiB(31.24%)  wps: 25,685  mfu: 8.04%
[rank0]:2024-08-21 23:26:50,876 - root - INFO - step:  9  loss:  7.4971  memory: 14.85GiB(31.24%)  wps: 19,739  mfu: 6.18%
[rank0]:2024-08-21 23:26:51,430 - root - INFO - step: 10  loss:  7.4064  memory: 14.85GiB(31.24%)  wps: 29,595  mfu: 9.26%
[rank0]:2024-08-21 23:26:51,939 - root - INFO - Dumping traces at step 10
[rank0]:2024-08-21 23:26:51,988 - root - INFO - Finished dumping traces in 0.05 seconds
[rank0]:2024-08-21 23:26:51,998 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:2024-08-21 23:26:54,000 - root - INFO - Training completed

philippguevorguian

A couple of points to ensure future extensibility, otherwise LGTM

torchtitan/models/opt/__init__.py

torchtitan/models/opt/utils.py

philippguevorguian · 2024-08-22T09:26:51Z

torchtitan/parallelisms/__init__.py

@@ -19,8 +19,9 @@
 models_parallelize_fns = {
    "llama2": parallelize_llama,
    "llama3": parallelize_llama,
+    'opt': parallelize_llama,


could we change the name of parallelize_llama to parallelize_decoder_only? This looks like it's a bug even if it's not.

train.py

…s not passing by default

tigranfah added 5 commits August 20, 2024 23:24

add basic OPT implementation

41fcd30

add bias to layer norm for OPT implementation to map parameter count

2e49d29

fix positional embedding for opt, successful debugmodel run with opt

ff540d4

add initial appempt of opt model loading and distributed training

c22384c

add init_weights parameter to llamadebug config

d01d1b3

tigranfah assigned tigranfah and philippguevorguian and unassigned tigranfah Aug 21, 2024

philippguevorguian self-requested a review August 22, 2024 09:20

philippguevorguian reviewed Aug 22, 2024

View reviewed changes

tigranfah added 2 commits August 23, 2024 01:02

bring back float8

cc8c50c

remove use_for_integration_test from galactica_125m.toml, remove test…

b08397a

…s not passing by default

philippguevorguian approved these changes Aug 24, 2024

View reviewed changes

philippguevorguian merged commit 21d8e10 into main Aug 24, 2024
0 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OPT model implementation, OPT model loading functionality from huggingface, and training OPT models with FSDP #4

Add OPT model implementation, OPT model loading functionality from huggingface, and training OPT models with FSDP #4

tigranfah commented Aug 21, 2024

philippguevorguian left a comment

philippguevorguian Aug 22, 2024

Add OPT model implementation, OPT model loading functionality from huggingface, and training OPT models with FSDP #4

Add OPT model implementation, OPT model loading functionality from huggingface, and training OPT models with FSDP #4

Conversation

tigranfah commented Aug 21, 2024

philippguevorguian left a comment

Choose a reason for hiding this comment

philippguevorguian Aug 22, 2024

Choose a reason for hiding this comment