PPO slow start #2044

felipemello1 · 2024-11-21T17:05:50Z

I tried using this https://github.com/pytorch/torchtune/blob/main/recipes/configs/mistral/7B_full_ppo_low_memory.yaml and gave up after 10min no first step, even after turning compile=False
we shouldn't be compiling the entire model. Instead, we should use the utility that compiles per layer, like here:

torchtune/recipes/lora_dpo_single_device.py

Line 296 in e9fd56a

training.compile_model(model)
DPO distributed doesn't compile the model. Should it?

felipemello1 assigned SalmanMohammadi Nov 21, 2024

Provide feedback