Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral Support #4

Open
fakerybakery opened this issue Dec 31, 2023 · 16 comments
Open

Mistral Support #4

fakerybakery opened this issue Dec 31, 2023 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@fakerybakery
Copy link

Hi,
Great work on this! Is Mistral supported? Right now I only see GPT-J and Llama 2.
Thank you!

@dkmisra
Copy link
Collaborator

dkmisra commented Dec 31, 2023

We currently dont support Mistral but please check again in mid-Jan for the new release. I would assume that it should be straightforward to replace the LLM calls with Mistral AI as we use the standard huggingface interface. So, one would replace the LLM call from Llama here and here to the Mistral code:

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

which are mentioned here: https://huggingface.co/docs/transformers/main/en/model_doc/mistral

That said, sometimes the code requires some small changes to be made elsewhere. E.g., the gold answer for GPTJ often uses space at the beginning (i.e., the correct token is " London" instead of "London"), whereas for the Llama2 it doesn't.

The refactoring that we are working on will take care of all these small changes and we aim to release it as soon as possible. If your need is more urgent, then let us know how to provide more specific guidance. And we would welcome any pull request that you want to provide.

@dkmisra dkmisra assigned dkmisra and pratyushasharma and unassigned dkmisra Dec 31, 2023
@fakerybakery
Copy link
Author

Thank you!

@Mihaiii
Copy link

Mihaiii commented Dec 31, 2023

And we would welcome any pull request that you want to provide.

I could provide pull requests because there are small typos (ex: lname: {best_lnum}), but I thought it's best to first wait for the refactoring. :)

BTW, yesterday I applied LASER to a 34b Yi model. To my knowledge, this is the first model on huggingface that is changed with your method. My intention is to further target other layers (basically apply LASER on multiple layers).

https://huggingface.co/Mihaiii/Pallas-0.5-LASER-0.1

@dkmisra
Copy link
Collaborator

dkmisra commented Dec 31, 2023

@Mihaiii , this is exciting and we are looking forward to your PR. It maybe a good idea to wait for the refactoring and we are working to release it very quickly. I would hope to get it done by Jan 10th-15th.

@dkmisra
Copy link
Collaborator

dkmisra commented Dec 31, 2023

We also plan to release a leaderboard style table on the website. We can add the numbers from your study and cite your HF release, once it is up, and if you want. Thanks again.

@fakerybakery
Copy link
Author

And we would welcome any pull request that you want to provide.

I could provide pull requests because there are small typos (ex: lname: {best_lnum}), but I thought it's best to first wait for the refactoring. :)

BTW, yesterday I applied LASER to a 34b Yi model. To my knowledge, this is the first model on huggingface that is changed with your method. My intention is to further target other layers (basically apply LASER on multiple layers).

huggingface.co/Mihaiii/Pallas-0.5-LASER-0.1

Nice! Quick question - how much compute did you need to create the Yi version?

@Mihaiii
Copy link

Mihaiii commented Dec 31, 2023

Nice! Quick question - how much compute did you need to create the Yi version?

It's proportional with how many interventions you are willing to try out before settling on best result. I used an A100 for more than 6 hours of experiments. I'd assume it takes somewhere a little less than half an hour per experiment, part of that being the Yi model loading extremely slow and me loading it on each experiment due to memory constraints.

@Mihaiii
Copy link

Mihaiii commented Dec 31, 2023

We can add the numbers from your study and cite your HF release, once it is up, and if you want.

Of course I'd want. Thank you! :)
But that model is intermediate work - I'll apply LASER on multiple layers of the same single model instance.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 3, 2024

Sounds good. @Mihaiii , one thing which we tried in many datasets, is to only focus on the last 20-30% of the layers, and MLP parameters, and try very aggressive reductions. We often used rate (in our code) as 9.9 and 9.95. In ρ term, this would be keeping only 0.01 and 0.005 fraction of the max rank (max{m, n}). I'd be curious to see what kind of gains these reduction do.

E.g., see the for loop here: https://github.com/pratyushasharma/laser/blob/main/src/intervention_llama2_bbh.py#L326

@Mihaiii
Copy link

Mihaiii commented Jan 3, 2024

I applied 6 interventions on 6 different layers and the end result of that experiment is not very good, but I messed up with some attn layers and I think that was a mistake (I should have target only mlp layers always).

I'm waiting for your refactoring and your changes mentioned here and then I'll retry.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 3, 2024

Sounds good. Yes, I would not target attention at all and to begin with, I would start by only targetting a single layer. In our code, --lname fc_in means the first layer of MLP and --lname fc_out means the second layer of MLP. We do have an experiment with stacking in our paper and we do it in a greedy selection way.

If you want, I can take a quick look at your python code, but no pressure at all. I am trying to get refactoring done soon.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 4, 2024

Related to #9

@Mihaiii
Copy link

Mihaiii commented Jan 4, 2024

If you want, I can take a quick look at your python code, but no pressure at all. I am trying to get refactoring done soon.

Thank you, but I don't have anything else than the branch I mentioned in the readme of the models I pushed for that experiment.
Basically, I followed the same greedy approach, but I always chose between attn and mlp. On some layers I included 4 and 6 rate when searching for best results for that layer, on others I didn't. The code from my branch is pretty chaotic.

In any case, I made a new model where I didn't target the attn layers: https://huggingface.co/Mihaiii/Pallas-0.5-LASER-exp2-0.1 and is ok after a quick testing.

Here I directly chose the layers to target based on previous experiments.

I tried to frankenmerge this new model using the mergekit, but I never got good results, no matter the merge config I tested (I tried to stack the layers I applied LASER to - meaning 56, 54, 51 - and I also tried to stack the layers in between the layers I applied LASER to). I thought it would be interesting to frankenmerge a model that had a LASER intervention, but it seems one of those situation where I overcomplicated things and ruined something that was working :) .

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 4, 2024

Okay. Let me take a look at your repo later today. I can also try to run these experiments on my end later this week. I will put a leaderboard up later today, where we can collate all results together.

We did have to do a fairly exhaustive search for hyperparameters to do find optimal validation accuracy. I will also share the results of each hyperparameter in our search space, in case that helps people find good hyperparameters. Finding good hyperparameters is a very much timely and open question at the moment. We are also learning more as we try more experiments.

@dkmisra
Copy link
Collaborator

dkmisra commented Jan 5, 2024

@Mihaiii , I am looking at your code here. Did you implement the LASER reductions for the LLM you are using? The code that exists only works for Llama/GPTJ and Roberta. You need to implement mappings (see here). I would double check if reductions are even happening. I didn't go through all the changes you may have made.

Also, based on our experience, doing any intervention in attention matrices typically hurts the performance. I would also start by doing reduction on a single weight matrix (either the first layer of MLP or the second layer), and in the latter sections.

@Mihaiii
Copy link

Mihaiii commented Jan 5, 2024

@dkmisra Thanks for your answer. In my testing, I also got perf issues at inference time - sampling was taking longer (note that I'm converting everything to llama.cpp's Q8 GGUF and that I use for inference). I'm pretty convinced the intervention happens since I'm getting considerably different results (meaning what is logged) on the dataset I'm using (bigbench, subset causal_judgement) with different inputs (when I say inputs, I mean what I change in the code directly, not command line args, since those args are overwritten in code).

I'm using the default Llama architecture mappings since Yi uses Llama architecture and therefore everything is supported.

In any case, that code from my branch doesn't come with anything new or considerably useful - I'll use your standard code when available.

Thank you! :)

@dkmisra dkmisra added the enhancement New feature or request label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants