-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : add speculative decoding support #10455
Conversation
1973399
to
7dc6ae5
Compare
From what I have read the goal is faster inference while retaining quality of the larger model. I am using rx6900xt with vulkan I get about 10-12 t/s with an incorrect configuration.
Flipping the models increased speed and the output looks similar. This makes sense since the -md is the draft model which is supposed to be the smaller model. I get about 16 t/s with the correct configuration.
Setting a lower context 2048, when the limit is reached the server crashed. |
c5ddee2
to
e80f758
Compare
@3Simplex What is the output of the following bench on your machine: llama-bench.exe -m "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1 |
.\llama-bench.exe -m "...\Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1
build: 0c74590 (4160) |
I tried out commit e80f758 with my P40s, 3xP40s and 3090. These are the commands for the baselines and the tests. Baseline:
With speculative model (just removed the
Tested it with curl using:
Data:
|
e80f758
to
d905266
Compare
Currently, it requires
The biggest benefit from speculative sampling is when you have more grounding. For example, if you have enough memory for a bigger context, you can try something like this: # get the llama.vim plugin source code
code=$(curl -s https://raw.githubusercontent.com/ggml-org/llama.vim/refs/heads/master/autoload/llama.vim | jq -sRr @json)
# ask qwen to implement something (speculative decoding disabled)
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
'{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" | jq -r .choices[0].message.content
# speculative decoding enabled
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
'{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content With CUDA, you might want to try setting |
Thank you for the guidance. Using d905266, I reran the tests. Results look quite good.
Server command:
Kept this pretty consistent, except for the 3xP40 run where I added Client side:
For the client side curl, I changed Here are the raw results. Some observations first:
3090 data
single P40
3xP40 (-sm row)
Code generated: function! s:chunk_sim(c0, c1)
let l:lines0 = join(a:c0, "\n")
let l:lines1 = join(a:c1, "\n")
let l:distance = levenshtein(l:lines0, l:lines1)
return 1 - (l:distance / max([strlen(l:lines0), strlen(l:lines1)]))
endfunction
function! levenshtein(s1, s2)
let l:len1 = strlen(a:s1)
let l:len2 = strlen(a:s2)
if l:len1 == 0
return l:len2
endif
if l:len2 == 0
return l:len1
endif
let l:dp = []
for i in range(l:len1 + 1)
call add(l:dp, [])
for j in range(l:len2 + 1)
call add(l:dp[i], 0)
endfor
endfor
for i in range(l:len1 + 1)
let l:dp[i][0] = i
endfor
for j in range(l:len2 + 1)
let l:dp[0][j] = j
endfor
for i in range(1, l:len1 + 1)
for j in range(1, l:len2 + 1)
let l:cost = (strcharpart(a:s1, i - 1, 1) == strcharpart(a:s2, j - 1, 1)) ? 0 : 1
let l:dp[i][j] = min([l:dp[i - 1][j] + 1, l:dp[i][j - 1] + 1, l:dp[i - 1][j - 1] + l:cost])
endfor
endfor
return l:dp[l:len1][l:len2]
endfunction |
Also, is |
Thanks for the detailed tests. The results are inflated because there is one tricky side effect from the caching - consecutive runs with the same prompt will reuse the previous draft context which combined with greedy sampling would make the drafting instantaneous. So basically, in the following data for example, only the first result is relevant:
i.e.
This was a bug - it is fixed now. You should be able to change Btw, here is another fun test that I came up with which uses less context and is suitable for speculation: # get top 10 stories from Hacker News
hn=$(curl -s https://hacker-news.firebaseio.com/v0/topstories.json | jq -r '.[:10] | @tsv' | tr '\t' '\n' | xargs -I {} curl -s "https://hacker-news.firebaseio.com/v0/item/{}.json" | jq -sRr @json)
# make a Markdown table based on some criteria
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg hn "$hn" \
'{ messages: [{ role: "system", content: "You are a helpful text-editing assistant. Respond only with the requested text. Do not add any other comments to your response." }, { role: "user", content: "Extract a Markdown table that contains only stories about software engineering, AI or machine learning from the front-page of HN. The table should include: author, title, score, comments and an URL to the story: ```\($hn)```." }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content |
Thanks. That seems a lot more realistic. I did some tests with a much shorter prompt: "write snake game in swift"
|
These numbers look reasonable. The speedup can vary in both ways based on the inputs, but enabling speculative should almost never result in slower than normal decoding. |
With this build I am up to 25t/s on first run generation with speculative decoding using 15/5 draft tokens. |
A bit of data with llama-3.1 70B and llama-3.2 1B as the draft model. Prompt: "write a story about the natural resources in Canada".
Server:
client (changed speculative.n_max between
|
Note that I am not very sure what happens with multiple GPUs, but it is possible that the draft model gets split across them, which is not desired (see the logs if that is the case). You would want to keep the draft model fully on one GPU. |
c277c4d
to
156aa6d
Compare
I wonder if it is possible to load draft and main model onto different backend. Ie a 7900xtx and P40 in a -cb process |
Sorry, I don't know how to call it. I am referring to the
I am still new to all of this but I thought that the higher that % the greater the speed improvement would be with this speculative decoding. Is it a wrong assumption? |
Thanks for this PR. +50% speed gain with 14b Q4_K_L & 0.5b Q4_0 on RTX 2070 Q-max + Quadro P5000. Now 28+ t/s on small prompt and 350-token response. My mistake earlier, I've got an unrelated issue with my 8GB+16GB GPU setup. 8K context non-quantized without overflow to shared GPU-mem but 8K quantized Q4_0 context overflows 0.5GB per GPU. Enough NVRAM but overflows to slow RAM. |
@ggerganov I am testing with this command: llama-speculative -m $HOME/.cache/lm-studio/models/lmstudio-community/Qwen2.5-14B-Instruct-GGUF/Qwen2.5-14B-Instruct-Q4_K_M.gguf -p "what is kubernetes" -t 14 -ngl 1000 -ngld 1000 -fa -md $HOME/.cache/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf --top-k 1 --draft-max 16 --draft-min 5 I have tried different values and different sizes for the draft model. Speed is similar as without the draft model config. What else can I try? |
This gave only a minimal speedup, still slower then without draft model, but thanks. |
@ggerganov do you have an opinion on lookup decoding in the server? My previous attempts didn't work consistent or well enough where I felt the additional complexity in the server would be worthwhile, #8648 is the last update. But if there already is speculative decoding support the additional complexity would probably not be too high. Though now that llama.cpp training will soon be available it may make more sense to distill a model for speculative decoding. |
Trying to test mistral large with mistral 7B as draft model, the server throws an error I see this mentioned on reddit, but people seem to have made it work using exlv2 and tabby? |
@JohannesGaessler I think we can experiment with lookup decoding and other similar approaches. The design currently is to provide the draft with the llama.cpp/examples/speculative-simple/speculative-simple.cpp Lines 138 to 147 in 9fd8c26
This call simply has to provide the drafted tokens and it does not matter how exactly they are generated. So abstracting the Btw, I am also interested in ways to make the draft generation asynchronous similar to the approach in #6853. However, the reconciliation logic might be more difficult to implement and might require too much changes in the examples and the server to support. So probably further in the future. |
@Gobz You can try to disable the checks in |
@ggerganov Works great, it seems to be just the [IMG] and [control_8] tokens that are mismatched, so for general use it's fine if you don't need those. |
Do you care to speculate how this will impact inference on pure CPU then? The memory bandwidth on consumer CPUs is nothing to be impressed by. |
A simple change that could already be implemented would be to process the prompt on the draft model and the main model simultaneously. This would work on backends that support async compute like CUDA, but for other backends using |
AMD was claiming pretty high acceptance rates for their model with llama 7B as the target. I wonder if those results can be repeated here. Also the speedup was even higher on CPUs MI250x is like 2.5x and CPU is like 3.5x (maybe this is because its a very simple model and the CPU has enough omph to run it as a draft, while some of the other draft models run too slowly on CPU to help). The actual output of this model is pretty bad, alone ... but apparently its a good draft model? The interesting inference perf numbers are at the bottom of the page. |
Sorry, I am going mad.
But using this command: And then call it:
I get
Is the server using speculative decoding? I feel its not. Why? |
Just dropping my results for speculative decoding on a NVIDIA GeForce RTX 4070 Ti SUPER: without a draft model: 37 t/sllama-cli `
--model '.\vendor\llama.cpp\models\Qwen2.5-Coder-32B-Instruct.IQ3_XXS.gguf' `
--ctx-size 16384 `
--threads 16 `
--n-gpu-layers '99' `
--cache-type-k 'q8_0' `
--cache-type-v 'q8_0' `
--flash-attn `
--prompt "Write tetris in JavaScript"
with a draft model: 31 t/sllama-speculative `
--model '.\vendor\llama.cpp\models\Qwen2.5-Coder-32B-Instruct.IQ3_XXS.gguf' `
--ctx-size 16384 `
--threads 16 `
--n-gpu-layers 99 `
--cache-type-k 'q8_0' `
--cache-type-v 'q8_0' `
--flash-attn `
--model-draft '.\vendor\llama.cpp\models\Qwen2.5-Coder-0.5B-Instruct.IQ4_XS.gguf' `
--ctx-size-draft 16384 `
--n-gpu-layers-draft 99 `
--draft-min 5 `
--draft-max 16 `
--prompt "Write tetris in JavaScript"
Questions
|
If I run parallel requests, then I get the error : llama_get_logits_ith: invalid logits id 48, reason: batch.logits[48] != true It seems not to be in the parallel option, if I run it with parallel 8 but I serialise the requests from the client then it goes okay, if I set the client to parallelise 2 requests then it comes after a time, but if I set the client to parallelise 8 requests then it is almost immediate. It looks like the continuous batching removes items before they are done, if I disable continuous batching then the speculative decoding gives no error. |
@countzero |
@ggerganov, that fix seems to work perfectly, thnx for the extremely quick fix. |
Could someone help me out? I'm trying to figure out where I'm going wrong. I have an M4 Pro with 64 GB of memory and when I use the 32-bit Qwen models (both the regular and coder versions) with Llama.cpp, I usually get about 11 tokens per second. I'm trying to see if I can boost the speed by using speculative decoding, but I haven't had much luck so far. For instance, when I run the following command: llama-speculative -m $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -p "write a ruby script to count the files in a directory recursively" -ngl 1000 -ngld 1000 -fa -md $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_0.gguf --top-k 1 --draft-max 16 --draft-min 5 I get this output:
There's no noticeable speed improvement. I also tried running the server with: llama-server -m $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -ngl 99 -ngld 99 -fa -md $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_0.gguf --top-k 1 --draft-max 16 --draft-min 5 --port 8033 But I still see the same token speed and no improvement. What am I missing here? |
I wonder if the default p-min of 0.9 is too high. I can get a further 20-30% speedup by setting a lower GPU: RTX 4060Ti 16GB
|
@PkmX The p-min = 0.9 is very conservative. The idea is to enable the speculation only for blocks of tokens where the LLM is very confident. With CUDA, it might be better to reduce p-min and also n-min. Feel free to experiment. |
Why am I getting a consistent 60 tokens/sec with llama-speculative while only 40 tokens/s through llama-server? Using the following two commands: llama-specutalive:
llama-server:
And then querying the exact same prompt through openweb ui, with temperature set to 0 and top-k to 1. Is there anything that can explain this rather big discrepancy? llama-speculative: |
Quick update: Dropping p-min increased the tokens/second for llama-server. I maxed out the speed at 53 tokens/second at 0.4 p-min, which remaining at 53 tokens/second all the way down to 0. Two questions that come to mind:
Update 2: Managed to obtain the following result: This was obtained through the following command: Over 60 tokens/second on a single 7900XTX! What a time to be alive :) Thank you so much for all your hard work @ggerganov ! Still very curious why I need different settings between llama-speculative and llama-server, but at least I am extremely happy I was able to fully unlock the potential of my 7900XTX |
@Mushoz |
target #10362
Initial implementation that enables speculative decoding in
llama-server
. Test with this command:--draft-max
and--draft-min
might need tuningllama.cpp
Web UI clientTop K = 1
-devd
argument to put the draft model on only one of them (llama : accept a list of devices to use to offload a model #10497)Feedback is appreciated.
TODO:
server.params
to something else to avoid confusions