-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral Support #4
Comments
We currently dont support Mistral but please check again in mid-Jan for the new release. I would assume that it should be straightforward to replace the LLM calls with Mistral AI as we use the standard huggingface interface. So, one would replace the LLM call from Llama here and here to the Mistral code:
which are mentioned here: https://huggingface.co/docs/transformers/main/en/model_doc/mistral That said, sometimes the code requires some small changes to be made elsewhere. E.g., the gold answer for GPTJ often uses space at the beginning (i.e., the correct token is " London" instead of "London"), whereas for the Llama2 it doesn't. The refactoring that we are working on will take care of all these small changes and we aim to release it as soon as possible. If your need is more urgent, then let us know how to provide more specific guidance. And we would welcome any pull request that you want to provide. |
Thank you! |
I could provide pull requests because there are small typos (ex: lname: {best_lnum}), but I thought it's best to first wait for the refactoring. :) BTW, yesterday I applied LASER to a 34b Yi model. To my knowledge, this is the first model on huggingface that is changed with your method. My intention is to further target other layers (basically apply LASER on multiple layers). |
@Mihaiii , this is exciting and we are looking forward to your PR. It maybe a good idea to wait for the refactoring and we are working to release it very quickly. I would hope to get it done by Jan 10th-15th. |
We also plan to release a leaderboard style table on the website. We can add the numbers from your study and cite your HF release, once it is up, and if you want. Thanks again. |
Nice! Quick question - how much compute did you need to create the Yi version? |
It's proportional with how many interventions you are willing to try out before settling on best result. I used an A100 for more than 6 hours of experiments. I'd assume it takes somewhere a little less than half an hour per experiment, part of that being the Yi model loading extremely slow and me loading it on each experiment due to memory constraints. |
Of course I'd want. Thank you! :) |
Sounds good. @Mihaiii , one thing which we tried in many datasets, is to only focus on the last 20-30% of the layers, and MLP parameters, and try very aggressive reductions. We often used rate (in our code) as 9.9 and 9.95. In ρ term, this would be keeping only 0.01 and 0.005 fraction of the max rank (max{m, n}). I'd be curious to see what kind of gains these reduction do. E.g., see the for loop here: https://github.com/pratyushasharma/laser/blob/main/src/intervention_llama2_bbh.py#L326 |
I applied 6 interventions on 6 different layers and the end result of that experiment is not very good, but I messed up with some attn layers and I think that was a mistake (I should have target only mlp layers always). I'm waiting for your refactoring and your changes mentioned here and then I'll retry. |
Sounds good. Yes, I would not target attention at all and to begin with, I would start by only targetting a single layer. In our code, --lname fc_in means the first layer of MLP and --lname fc_out means the second layer of MLP. We do have an experiment with stacking in our paper and we do it in a greedy selection way. If you want, I can take a quick look at your python code, but no pressure at all. I am trying to get refactoring done soon. |
Related to #9 |
Thank you, but I don't have anything else than the branch I mentioned in the readme of the models I pushed for that experiment. In any case, I made a new model where I didn't target the attn layers: https://huggingface.co/Mihaiii/Pallas-0.5-LASER-exp2-0.1 and is ok after a quick testing. Here I directly chose the layers to target based on previous experiments. I tried to frankenmerge this new model using the mergekit, but I never got good results, no matter the merge config I tested (I tried to stack the layers I applied LASER to - meaning 56, 54, 51 - and I also tried to stack the layers in between the layers I applied LASER to). I thought it would be interesting to frankenmerge a model that had a LASER intervention, but it seems one of those situation where I overcomplicated things and ruined something that was working :) . |
Okay. Let me take a look at your repo later today. I can also try to run these experiments on my end later this week. I will put a leaderboard up later today, where we can collate all results together. We did have to do a fairly exhaustive search for hyperparameters to do find optimal validation accuracy. I will also share the results of each hyperparameter in our search space, in case that helps people find good hyperparameters. Finding good hyperparameters is a very much timely and open question at the moment. We are also learning more as we try more experiments. |
@Mihaiii , I am looking at your code here. Did you implement the LASER reductions for the LLM you are using? The code that exists only works for Llama/GPTJ and Roberta. You need to implement mappings (see here). I would double check if reductions are even happening. I didn't go through all the changes you may have made. Also, based on our experience, doing any intervention in attention matrices typically hurts the performance. I would also start by doing reduction on a single weight matrix (either the first layer of MLP or the second layer), and in the latter sections. |
@dkmisra Thanks for your answer. In my testing, I also got perf issues at inference time - sampling was taking longer (note that I'm converting everything to llama.cpp's Q8 GGUF and that I use for inference). I'm pretty convinced the intervention happens since I'm getting considerably different results (meaning what is logged) on the dataset I'm using (bigbench, subset causal_judgement) with different inputs (when I say inputs, I mean what I change in the code directly, not command line args, since those args are overwritten in code). I'm using the default Llama architecture mappings since Yi uses Llama architecture and therefore everything is supported. In any case, that code from my branch doesn't come with anything new or considerably useful - I'll use your standard code when available. Thank you! :) |
Hi,
Great work on this! Is Mistral supported? Right now I only see GPT-J and Llama 2.
Thank you!
The text was updated successfully, but these errors were encountered: