-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a bug found in save_model of LOMOTrainer #54
Comments
Thanks for your kind feedback and |
很荣幸我的建议被采纳。我还想问问您对之前 |
作者您好,我之所以想知道这个问题的答案,是因为我看到 LOMO 的优化器实现和 Hi, I want to know the answer to this question because I find the implementation of LOMO and the code of |
@DingQiang2018 您好,我注意到作者按照您的建议修改了LOMOTrainer和LOMOLoRaTrainer,LOMOTrainer运行没有问题,但LOMOLoRaTrainer会在self.model.optimizer.partition_all_parameters()处报错,您是否遇到了同样的问题呢?谢谢! |
Yeah I am having this issue, did you find any solution? |
|
我也不能 在 merge llama with lora 之后得到相同的结果,很奇怪 |
Hi, lomo_lora_trainer中因为多了lora的optimizer所以不能通过model.optimizer来调用DeepSpeedZeRoOffload。我目前把lomo_lora_trainer.py中的save_model()回退到之前版本了。 |
Hi, 我想知道使用ChatGLM2的loss两种保存方法会差多少,不知道您是否还有记录?BTW,LLaMA会有同样的问题吗? |
Hi, 我注意到了,但是我目前还是没办法做到 merge 之后的 model 有相同的eval resutls。。。。 |
我使用lomo(和zero3)在8张NVIDIA 3090 GPU上微调chatglm2-6b,并使用LOMOTrainer的save_model方法保存。重新加载模型checkpoint后,我发现模型测出来的验证集loss与训练结束时测出来的不一样。我参考deepspeed官方保存模型的代码,重写了save_model(重写的代码如下),发现这个bug解决了。这说明原来版本的save_model有bug,但我还没有找到具体出错原因。
I used LOMO (and zero3) to fine-tune chatglm2-6b on 8 NVIDIA 3090 GPUs and saved it using LOMOTrainer's save_model method. After reloading the model checkpoint, I found that the validation loss measured by the model differed from the validation loss measured at the end of training. I referred to the DeepSpeed official code, rewrote save_model (rewritten code below), and found this bug resolved. This indicates that the original version of save_model has a bug, but I have not yet figured out the specific cause of the error.
The text was updated successfully, but these errors were encountered: