Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OPT model implementation, OPT model loading functionality from huggingface, and training OPT models with FSDP #4

Merged
merged 7 commits into from
Aug 24, 2024

Conversation

tigranfah
Copy link
Member

Example run output

$ CONFIG_FILE="./train_configs/galactica_125m.toml" ./run_llama_train.sh
+ NGPU=2
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/galactica_125m.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ torchrun --nproc_per_node=2 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/galactica_125m.toml
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779]
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779] *****************************************
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0821 23:26:37.763000 135207170885440 torch/distributed/run.py:779] *****************************************
[rank0]:2024-08-21 23:26:39,933 - root - INFO - Starting job: Galactica debug training
[rank0]:2024-08-21 23:26:39,969 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-08-21 23:26:40,053 - root - INFO - GPU capacity: NVIDIA RTX A6000 (0) with 47.53GiB memory
[rank0]:2024-08-21 23:26:40,053 - root - INFO - Building 1-D device mesh with ['dp'], [2]
[rank0]:2024-08-21 23:26:40,054 - root - INFO - Building 1-D device mesh with ['dp'], [2]
[rank0]:2024-08-21 23:26:40,054 - root - INFO - Building tiktoken tokenizer locally from ./test/assets/test_tiktoken.model
[rank0]:2024-08-21 23:26:40,065 - root - INFO - TikTokenizer built: #words 2256, BOS ID 2000, EOS ID 2001
[rank0]:2024-08-21 23:26:40,065 - root - INFO - Preparing c4_test dataset from test/assets/c4_test
[rank0]:2024-08-21 23:26:40,089 - root - INFO - Building opt 125M with ModelArgs(dim=768, n_layers=12, n_heads=12, n_kv_heads=None, vocab_size=50000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, rope_theta=10000, dropout_p=0.1, max_batch_size=32, max_seq_len=2048, depth_init=True, norm_type='layernorm_bias')
[rank0]:2024-08-21 23:26:43,588 - root - INFO - Model opt 125M size: 163,430,400 total parameters
[rank0]:2024-08-21 23:26:43,588 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-08-21 23:26:43,884 - root - INFO - Applied FSDP to the model
[rank0]:2024-08-21 23:26:43,898 - root - INFO - GPU memory usage for model: 0.63GiB(1.33%)
[rank0]:2024-08-21 23:26:43,898 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240821-2326
[rank0]:2024-08-21 23:26:43,899 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 16, sequence length 2048, total steps 10 (warmup 2)
[rank0]:2024-08-21 23:26:43,899 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:/auto/home/tigranfahradyan/miniforge3/envs/titan/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
[rank0]:  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank0]:2024-08-21 23:26:45,523 - root - INFO - step:  1  loss: 11.8540  memory: 14.85GiB(31.24%)  wps: 10,093  mfu: 3.16%
[rank0]:2024-08-21 23:26:45,523 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-08-21 23:26:46,049 - root - INFO - step:  2  loss: 10.5006  memory: 14.85GiB(31.24%)  wps: 31,181  mfu: 9.76%
[rank0]:2024-08-21 23:26:46,713 - root - INFO - step:  3  loss: 14.6663  memory: 14.85GiB(31.24%)  wps: 24,699  mfu: 7.73%
[rank0]:2024-08-21 23:26:47,330 - root - INFO - step:  4  loss:  9.0398  memory: 14.85GiB(31.24%)  wps: 26,560  mfu: 8.31%
[rank0]:2024-08-21 23:26:48,147 - root - INFO - step:  5  loss:  9.3166  memory: 14.85GiB(31.24%)  wps: 20,074  mfu: 6.28%
[rank0]:2024-08-21 23:26:48,670 - root - INFO - step:  6  loss:  8.3659  memory: 14.85GiB(31.24%)  wps: 31,385  mfu: 9.82%
[rank0]:2024-08-21 23:26:49,406 - root - INFO - step:  7  loss:  7.8526  memory: 14.85GiB(31.24%)  wps: 22,260  mfu: 6.97%
[rank0]:2024-08-21 23:26:50,045 - root - INFO - step:  8  loss:  7.6558  memory: 14.85GiB(31.24%)  wps: 25,685  mfu: 8.04%
[rank0]:2024-08-21 23:26:50,876 - root - INFO - step:  9  loss:  7.4971  memory: 14.85GiB(31.24%)  wps: 19,739  mfu: 6.18%
[rank0]:2024-08-21 23:26:51,430 - root - INFO - step: 10  loss:  7.4064  memory: 14.85GiB(31.24%)  wps: 29,595  mfu: 9.26%
[rank0]:2024-08-21 23:26:51,939 - root - INFO - Dumping traces at step 10
[rank0]:2024-08-21 23:26:51,988 - root - INFO - Finished dumping traces in 0.05 seconds
[rank0]:2024-08-21 23:26:51,998 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:2024-08-21 23:26:54,000 - root - INFO - Training completed

Copy link
Collaborator

@philippguevorguian philippguevorguian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of points to ensure future extensibility, otherwise LGTM

torchtitan/models/opt/__init__.py Show resolved Hide resolved
torchtitan/models/opt/utils.py Show resolved Hide resolved
@@ -19,8 +19,9 @@
models_parallelize_fns = {
"llama2": parallelize_llama,
"llama3": parallelize_llama,
'opt': parallelize_llama,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we change the name of parallelize_llama to parallelize_decoder_only? This looks like it's a bug even if it's not.

train.py Show resolved Hide resolved
train.py Show resolved Hide resolved
train.py Show resolved Hide resolved
train.py Outdated Show resolved Hide resolved
@philippguevorguian philippguevorguian merged commit 21d8e10 into main Aug 24, 2024
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants