-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will it be possible to train the whole model using the 7B LLM #274
Comments
Hello! You wouldn't be able to take advantage of FSDP (sharding parameters etc.) as you are only using a single GPU. Maybe you can use FSDP to offload parameters to CPU but we don't support this so you would have to modify the code to do so. If by 'fine-tune the whole model' you mean the cross attention weights (which are the only trainable parameters for Flamingo) then the entire model will fit on an 80GB gpu :). |
Thanks for your quick reply. I am sorry that I did not make myself clear. The largest GPU I have is 80G and I can use more than 1 GPU. I am trying to finetune all the parameters (7B) even using Thanks! |
Got it! Ok since you do have multiple GPUs you should use FSDP. I wouldn't train using pure bf16 as we didn't have success with that. You should go with amp_bf16. If you are running into issues with that then please share the command you are running. |
Thanks for your quick reply. The same as #232, I met the same dimension mismatch issue. I also notice that Gao has started PR #261 to fix it. However, she made a big refactor of the code. I can use her code, but are there more minor modifications if I want to use the Thanks for your suggestions! |
Hi @anas-awadalla, Thanks for your quick reply. I am still trying to fine-tune the whole parameters. I tried the 3B model this time because I think it fits my 80G GPU. I just set
I followed the #253 to comment the Really appreciate your patience and help! |
Hmm yeah this error indicates that there is a trainable component not being used. One thing that comes to mind is maybe you have some samples without images? Also you do intend for the vision encoder to be unfrozen? Can you print the name of the trainable parameters in the model and share them here? |
Hi @anas-awadalla, I really appreciate your help. I set this line and modify model parameters gradient settings as below. I also add an indices-parameter output for loop to tell me the model parameter indices.
I also uncomment this with no grad() line in the function Could you please help me? Thanks! The original trainable parameters are as follows.
|
This is super useful! I am looking at the vision encoder's forward pass from OpenCLIP. We are using the tokens and not the pooled output. I think in this case we may be running into a case where the ln_post parameters are not used in the forward pass. I am still unsure about why the cls embedding is not being used. It clearly is here. Which CLIP model are you using? Maybe that has something to do with it? In any case might not be harmful at all to freeze cls embedding as well although a bit hacky. |
Sorry that I forgot to reply! Thanks a lot! Very helpful! |
Dear authors,
Thanks for your great work. I wonder whether it is possible to fine-tune the whole model whose LLM is 7B using one 80G GPU if I use some settings, like FSDP, bfp16, etc.
Thanks!
The text was updated successfully, but these errors were encountered: