New Apple Repo: Efficient Vision Transformers on Apple Neural Engine #3

smpanaro · 2024-01-15T17:37:38Z

smpanaro
Jan 15, 2024
Maintainer

Apple released a new ANE-optimized transformers implementation (repo, blog).
h/t @antmikinka for pointing it out

Seems like they had some special obstacles to overcome specifically related to vision, but they mention 3 things that apply to ANE transformers in general:

Split Softmax
- They say this is the reason for ml-ane-transformers' speedup in the attention computation (implying it's not the smaller matrix multiplications, I think). Interestingly the iOS 17 autocomplete model does not split the softmax. I have tried both ways in this repo but didn't see a difference. My hypothesis is that at gpt2-scale¹, the bottleneck is memory bandwidth (loading weights) so this class of speedup isn't helpful.
- Additionally, splitting the attention layer does not scale for larger models. It takes a very long time for CoreML to compile them. Hopefully Apple improves this in the future.
Conv2d 1x1
- This is the same optimization from ml-ane-transformers. Similar to above, my current hypothesis is gpt2-size models are weight bound which is why this doesn't seem to help.
Chunking Large query, key, and value tensors
- This is probably worth trying, both gpt2 + Pythia use a single tensor for the QKV linear layer but we could split it. It sounds like we might benefit from some caching.
- Will try this at some point. Shouldn't be too hard.

It's interesting to see that they're using a CNN-Transformer hybrid here too. The iOS 17 speech-to-text model uses one as well. Wonder if we'll see more of that in the future.

Apple's Tiny-MOAT-1 (TM1) models have ~5M parameters (with 256x256 input) and ~10M parameters (with 512x512 input) vs the smallest gpt2 with 117M (80M excluding embedding params). ↩

antmikinka · 2024-02-08T21:13:53Z

antmikinka
Feb 8, 2024

Split Softmax
They say this is the reason for ml-ane-transformers' speedup in the attention computation (implying it's not the smaller matrix multiplications, I think). Interestingly the iOS 17 autocomplete model does not split the softmax. I have tried both ways in this repo but didn't see a difference. My hypothesis is that at gpt2-scale1, the bottleneck is memory bandwidth (loading weights) so this class of speedup isn't helpful.

To fix the bottleneck at the memory width, I propose:

loading the necessary (only whats needed to compute) compressed weight into memory -> decompressing them -> computing the function -> recompressing weights -> deallocating memory -> move onto next weight part and repeat. The way to correctly get the necessary info for computing the particular function would need to be an index of an index. Less memory is needed to compute the function, function can be computed, memory is deallocated, next computation.
This ane-llama repo mentions some principles which may help:

They agree with the QKV tensors.

Principle 3: Minimizing Memory Copies
We use the bchq,bkhc->bkhq einsum formula, which represents a batched matmul operation whose data format directly maps to hardware without intermediate transpose and reshape operations.

The Conv2d 1x1 may be helped by memory issues (I am not sure if you have this data format or not, figure I would share, its also from ane-llama)

Principle 1: Picking the Right Data Format
To migrate to the desirable (B, C, 1, S) data format, we swap all nn.Linear layers with nn.Conv2d layers.

I will be making this one of my coding projects to get models on ANE. I have finally realized even after trying MLX, that my 2020 Macbook Pro 8GB RAM will be incapable of ever interfacing with large models. To add, llama.cpp doesnt use ANE either.

Apple I believe understands that releasing an ANE API doc would massively put them ahead of others in the AI race, so they're waiting / commercializing on it first before opening the gates.

I will be taking a crack at trying to reproduce LLM in a flash: Efficient Large Language Model Inference with Limited Memory on my own. I can see how it would even further help memory issues.

I wonder if would be beneficial to code in swift, Apple's Native Language.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Apple Repo: Efficient Vision Transformers on Apple Neural Engine #3

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

New Apple Repo: Efficient Vision Transformers on Apple Neural Engine #3

smpanaro Jan 15, 2024 Maintainer

Footnotes

Replies: 1 comment

antmikinka Feb 8, 2024

smpanaro
Jan 15, 2024
Maintainer

antmikinka
Feb 8, 2024