Mistral Support #4

fakerybakery · 2023-12-31T01:00:48Z

Hi,
Great work on this! Is Mistral supported? Right now I only see GPT-J and Llama 2.
Thank you!

dkmisra · 2023-12-31T01:06:47Z

We currently dont support Mistral but please check again in mid-Jan for the new release. I would assume that it should be straightforward to replace the LLM calls with Mistral AI as we use the standard huggingface interface. So, one would replace the LLM call from Llama here and here to the Mistral code:

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

which are mentioned here: https://huggingface.co/docs/transformers/main/en/model_doc/mistral

That said, sometimes the code requires some small changes to be made elsewhere. E.g., the gold answer for GPTJ often uses space at the beginning (i.e., the correct token is " London" instead of "London"), whereas for the Llama2 it doesn't.

The refactoring that we are working on will take care of all these small changes and we aim to release it as soon as possible. If your need is more urgent, then let us know how to provide more specific guidance. And we would welcome any pull request that you want to provide.

fakerybakery · 2023-12-31T01:08:32Z

Thank you!

Mihaiii · 2023-12-31T13:34:26Z

And we would welcome any pull request that you want to provide.

I could provide pull requests because there are small typos (ex: lname: {best_lnum}), but I thought it's best to first wait for the refactoring. :)

BTW, yesterday I applied LASER to a 34b Yi model. To my knowledge, this is the first model on huggingface that is changed with your method. My intention is to further target other layers (basically apply LASER on multiple layers).

https://huggingface.co/Mihaiii/Pallas-0.5-LASER-0.1

dkmisra · 2023-12-31T18:35:29Z

@Mihaiii , this is exciting and we are looking forward to your PR. It maybe a good idea to wait for the refactoring and we are working to release it very quickly. I would hope to get it done by Jan 10th-15th.

dkmisra · 2023-12-31T18:38:34Z

We also plan to release a leaderboard style table on the website. We can add the numbers from your study and cite your HF release, once it is up, and if you want. Thanks again.

fakerybakery · 2023-12-31T19:45:52Z

And we would welcome any pull request that you want to provide.

I could provide pull requests because there are small typos (ex: lname: {best_lnum}), but I thought it's best to first wait for the refactoring. :)

BTW, yesterday I applied LASER to a 34b Yi model. To my knowledge, this is the first model on huggingface that is changed with your method. My intention is to further target other layers (basically apply LASER on multiple layers).

huggingface.co/Mihaiii/Pallas-0.5-LASER-0.1

Nice! Quick question - how much compute did you need to create the Yi version?

Mihaiii · 2023-12-31T19:57:21Z

Nice! Quick question - how much compute did you need to create the Yi version?

It's proportional with how many interventions you are willing to try out before settling on best result. I used an A100 for more than 6 hours of experiments. I'd assume it takes somewhere a little less than half an hour per experiment, part of that being the Yi model loading extremely slow and me loading it on each experiment due to memory constraints.

Mihaiii · 2023-12-31T20:41:13Z

We can add the numbers from your study and cite your HF release, once it is up, and if you want.

Of course I'd want. Thank you! :)
But that model is intermediate work - I'll apply LASER on multiple layers of the same single model instance.

dkmisra · 2024-01-03T22:31:14Z

Sounds good. @Mihaiii , one thing which we tried in many datasets, is to only focus on the last 20-30% of the layers, and MLP parameters, and try very aggressive reductions. We often used rate (in our code) as 9.9 and 9.95. In ρ term, this would be keeping only 0.01 and 0.005 fraction of the max rank (max{m, n}). I'd be curious to see what kind of gains these reduction do.

E.g., see the for loop here: https://github.com/pratyushasharma/laser/blob/main/src/intervention_llama2_bbh.py#L326

Mihaiii · 2024-01-03T22:39:56Z

I applied 6 interventions on 6 different layers and the end result of that experiment is not very good, but I messed up with some attn layers and I think that was a mistake (I should have target only mlp layers always).

I'm waiting for your refactoring and your changes mentioned here and then I'll retry.

dkmisra · 2024-01-03T23:17:30Z

Sounds good. Yes, I would not target attention at all and to begin with, I would start by only targetting a single layer. In our code, --lname fc_in means the first layer of MLP and --lname fc_out means the second layer of MLP. We do have an experiment with stacking in our paper and we do it in a greedy selection way.

If you want, I can take a quick look at your python code, but no pressure at all. I am trying to get refactoring done soon.

dkmisra · 2024-01-04T05:50:30Z

Related to #9

Mihaiii · 2024-01-04T13:09:59Z

If you want, I can take a quick look at your python code, but no pressure at all. I am trying to get refactoring done soon.

Thank you, but I don't have anything else than the branch I mentioned in the readme of the models I pushed for that experiment.
Basically, I followed the same greedy approach, but I always chose between attn and mlp. On some layers I included 4 and 6 rate when searching for best results for that layer, on others I didn't. The code from my branch is pretty chaotic.

In any case, I made a new model where I didn't target the attn layers: https://huggingface.co/Mihaiii/Pallas-0.5-LASER-exp2-0.1 and is ok after a quick testing.

Here I directly chose the layers to target based on previous experiments.

I tried to frankenmerge this new model using the mergekit, but I never got good results, no matter the merge config I tested (I tried to stack the layers I applied LASER to - meaning 56, 54, 51 - and I also tried to stack the layers in between the layers I applied LASER to). I thought it would be interesting to frankenmerge a model that had a LASER intervention, but it seems one of those situation where I overcomplicated things and ruined something that was working :) .

dkmisra · 2024-01-04T18:20:29Z

Okay. Let me take a look at your repo later today. I can also try to run these experiments on my end later this week. I will put a leaderboard up later today, where we can collate all results together.

We did have to do a fairly exhaustive search for hyperparameters to do find optimal validation accuracy. I will also share the results of each hyperparameter in our search space, in case that helps people find good hyperparameters. Finding good hyperparameters is a very much timely and open question at the moment. We are also learning more as we try more experiments.

dkmisra · 2024-01-05T06:12:56Z

@Mihaiii , I am looking at your code here. Did you implement the LASER reductions for the LLM you are using? The code that exists only works for Llama/GPTJ and Roberta. You need to implement mappings (see here). I would double check if reductions are even happening. I didn't go through all the changes you may have made.

Also, based on our experience, doing any intervention in attention matrices typically hurts the performance. I would also start by doing reduction on a single weight matrix (either the first layer of MLP or the second layer), and in the latter sections.

Mihaiii · 2024-01-05T06:28:58Z

@dkmisra Thanks for your answer. In my testing, I also got perf issues at inference time - sampling was taking longer (note that I'm converting everything to llama.cpp's Q8 GGUF and that I use for inference). I'm pretty convinced the intervention happens since I'm getting considerably different results (meaning what is logged) on the dataset I'm using (bigbench, subset causal_judgement) with different inputs (when I say inputs, I mean what I change in the code directly, not command line args, since those args are overwritten in code).

I'm using the default Llama architecture mappings since Yi uses Llama architecture and therefore everything is supported.

In any case, that code from my branch doesn't come with anything new or considerably useful - I'll use your standard code when available.

Thank you! :)

dkmisra assigned dkmisra and pratyushasharma and unassigned dkmisra Dec 31, 2023

dkmisra added the enhancement New feature or request label Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral Support #4

Mistral Support #4

fakerybakery commented Dec 31, 2023

dkmisra commented Dec 31, 2023

fakerybakery commented Dec 31, 2023

Mihaiii commented Dec 31, 2023

dkmisra commented Dec 31, 2023

dkmisra commented Dec 31, 2023 •

edited

Loading

fakerybakery commented Dec 31, 2023

Mihaiii commented Dec 31, 2023 •

edited

Loading

Mihaiii commented Dec 31, 2023

dkmisra commented Jan 3, 2024 •

edited

Loading

Mihaiii commented Jan 3, 2024

dkmisra commented Jan 3, 2024

dkmisra commented Jan 4, 2024

Mihaiii commented Jan 4, 2024 •

edited

Loading

dkmisra commented Jan 4, 2024 •

edited

Loading

dkmisra commented Jan 5, 2024 •

edited

Loading

Mihaiii commented Jan 5, 2024 •

edited

Loading

Mistral Support #4

Mistral Support #4

Comments

fakerybakery commented Dec 31, 2023

dkmisra commented Dec 31, 2023

fakerybakery commented Dec 31, 2023

Mihaiii commented Dec 31, 2023

dkmisra commented Dec 31, 2023

dkmisra commented Dec 31, 2023 • edited Loading

fakerybakery commented Dec 31, 2023

Mihaiii commented Dec 31, 2023 • edited Loading

Mihaiii commented Dec 31, 2023

dkmisra commented Jan 3, 2024 • edited Loading

Mihaiii commented Jan 3, 2024

dkmisra commented Jan 3, 2024

dkmisra commented Jan 4, 2024

Mihaiii commented Jan 4, 2024 • edited Loading

dkmisra commented Jan 4, 2024 • edited Loading

dkmisra commented Jan 5, 2024 • edited Loading

Mihaiii commented Jan 5, 2024 • edited Loading

dkmisra commented Dec 31, 2023 •

edited

Loading

Mihaiii commented Dec 31, 2023 •

edited

Loading

dkmisra commented Jan 3, 2024 •

edited

Loading

Mihaiii commented Jan 4, 2024 •

edited

Loading

dkmisra commented Jan 4, 2024 •

edited

Loading

dkmisra commented Jan 5, 2024 •

edited

Loading

Mihaiii commented Jan 5, 2024 •

edited

Loading