server : add speculative decoding support #10455

ggerganov · 2024-11-22T11:53:37Z

Initial implementation that enables speculative decoding in llama-server. Test with this command:

./bin/llama-server \
    -m  ../models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
    -md ../models/qwen2.5-0.5b-coder-instruct/ggml-model-q4_0.gguf \
    -ngl 99 -ngld 99 -fa --port 8033 -c 32768 \
    --draft-max 16 --draft-min 5

The --draft-max and --draft-min might need tuning
Use the built-in llama.cpp Web UI client
Set Top K = 1
With multiple GPUs, use the new -devd argument to put the draft model on only one of them (llama : accept a list of devices to use to offload a model #10497)

Feedback is appreciated.

TODO:

simplify
control draft context size
rename server.params to something else to avoid confusions
test multi-user
test offloading draft model with RPC

3Simplex · 2024-11-22T15:58:01Z

From what I have read the goal is faster inference while retaining quality of the larger model.

I am using rx6900xt with vulkan
Using Qwen2.5-Coder-7B-Instruct-Q8_0.gguf alone I see 50 t/s
Using Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf alone I see 230 t/s

I get about 10-12 t/s with an incorrect configuration.

.\llama-server.exe -m "...Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf" -md "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -ngl 99 -ngld 99 -fa --port 8080 -c 32768 --draft 10 --draft-min 5

Flipping the models increased speed and the output looks similar. This makes sense since the -md is the draft model which is supposed to be the smaller model.

I get about 16 t/s with the correct configuration.

.\llama-server.exe -m "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -md "...Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf" -ngl 99 -ngld 99 -fa --port 8080 -c 32768 --draft 10 --draft-min 5

Setting a lower context 2048, when the limit is reached the server crashed.

ggerganov · 2024-11-24T15:50:17Z

@3Simplex What is the output of the following bench on your machine:

llama-bench.exe -m "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1

3Simplex · 2024-11-24T15:56:01Z

@ggerganov

.\llama-bench.exe -m "...\Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64

model	size	params	backend	ngl	fa	test	t/s
ggml_vulkan: Compiling shaders..............................Done!
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp1	37.79 ± 0.30
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp1	37.81 ± 0.29
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp2	16.14 ± 0.04
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp3	23.40 ± 0.06
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp4	31.10 ± 0.04
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp5	37.39 ± 1.74
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp6	45.52 ± 0.06
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp7	51.53 ± 0.09
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp8	58.57 ± 0.28
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp12	80.38 ± 0.13
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp16	105.83 ± 0.54
qwen2 7B Q8_0	7.54 GiB	7.62 B	Vulkan	99	1	pp32	202.53 ± 0.21

build: 0c74590 (4160)

mostlygeek · 2024-11-24T19:54:49Z

I tried out commit e80f758 with my P40s, 3xP40s and 3090. These are the commands for the baselines and the tests.

Baseline:

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 5

With speculative model (just removed the -md model.gguf):

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf -md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 99 -fa --port 9999 -c 4096 --draft-max 16 --draft-min 5

Tested it with curl using:

for n in seq 1 5; do curl http://10.0.1.50:9999/v1/chat/completions -N -v -d '{"messages":[{"role":"user","content":"write hello world in golang"}],"temperature":0.1, "stream":false,"max_tokens":1000, "model":"coder" }'; done

Data:

GPU	baseline pp	baseline eval	`-md ...` pp	`-md ...` eval
3090	299 tps	34 tps	300 tps	31 tps
P40	101 tps	11.22 tps	101 tps	10.52 tps
3xP40	91 tps	10.6 tps	90 tps	9.8 tps

ggerganov · 2024-11-24T20:13:22Z

Currently, it requires cache_prompt: true to be set do speculation. Will be fixed in next PRs. Using greedy sampling should improve things as well:

cache_prompt: true, top_k: 1, samplers: ["top_k"]

The biggest benefit from speculative sampling is when you have more grounding. For example, if you have enough memory for a bigger context, you can try something like this:

# get the llama.vim plugin source code
code=$(curl -s https://raw.githubusercontent.com/ggml-org/llama.vim/refs/heads/master/autoload/llama.vim | jq -sRr @json)

# ask qwen to implement something (speculative decoding disabled)
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
  '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" | jq -r .choices[0].message.content

# speculative decoding enabled
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
  '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content

With CUDA, you might want to try setting "speculative.n_min": 0 or 1 since I think it has efficient small-batch kernels for Q4_K, so no need to skip the small batches.

mostlygeek · 2024-11-24T21:28:34Z

Thank you for the guidance. Using d905266, I reran the tests.

Results look quite good.

GPU	`n_max:0`	`n_max:16`	change
P40	8.7 tps	39.4 tps	4.45x
3xP40 `-sm row`	12.70 tps	53 tps	4.17x
3090	29 tps	167 tps	5.73x

Server command:

./llama-server -m /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf -md /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf -ngl 99 -ngld 99 -fa --port 9999 -c 10240 --draft-max 16 --draft-min 0 --host 0.0.0.0 2>&1 | grep 'eval time'

Kept this pretty consistent, except for the 3xP40 run where I added -sm row

Client side:

$ code=$(curl -s https://raw.githubusercontent.com/ggml-org/llama.vim/refs/heads/master/autoload/llama.vim | jq -sRr @json)

$ for n in `seq 1 5`; \
do \
    curl --request POST --url http://10.0.1.50:9999/v1/chat/completions \
        -H "Content-Type: application/json" -H "Authorization: Bearer no-key" \
        -d "$(jq -n --arg code "$code" '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content; \
done

For the client side curl, I changed speculative.n_max between 0 and 16 to get the different timings.

Here are the raw results. Some observations first:

with n_max: 0, 437 tokens were generated. With n_max: 16, 440 tokens were generated.
the server was restarted between tests to clear the cache
the code generated was identical (ran it through a diff)

3090 data

# speculative.n_max: 0
prompt eval time =    8032.34 ms /  8318 tokens (    0.97 ms per token,  1035.56 tokens per second)
       eval time =   14975.84 ms /   437 tokens (   34.27 ms per token,    29.18 tokens per second)
prompt eval time =      37.56 ms /     1 tokens (   37.56 ms per token,    26.62 tokens per second)
       eval time =   14988.71 ms /   437 tokens (   34.30 ms per token,    29.16 tokens per second)
prompt eval time =      37.15 ms /     1 tokens (   37.15 ms per token,    26.92 tokens per second)
       eval time =   15005.60 ms /   437 tokens (   34.34 ms per token,    29.12 tokens per second)
prompt eval time =      37.27 ms /     1 tokens (   37.27 ms per token,    26.83 tokens per second)
       eval time =   15017.94 ms /   437 tokens (   34.37 ms per token,    29.10 tokens per second)
prompt eval time =      37.49 ms /     1 tokens (   37.49 ms per token,    26.67 tokens per second)
       eval time =   15026.50 ms /   437 tokens (   34.39 ms per token,    29.08 tokens per second)

# speculative.n_max: 16
prompt eval time =    7915.24 ms /  8318 tokens (    0.95 ms per token,  1050.88 tokens per second)
       eval time =    9432.51 ms /   440 tokens (   21.44 ms per token,    46.65 tokens per second)
prompt eval time =      38.44 ms /     1 tokens (   38.44 ms per token,    26.02 tokens per second)
       eval time =    2626.82 ms /   440 tokens (    5.97 ms per token,   167.50 tokens per second)
prompt eval time =      37.93 ms /     1 tokens (   37.93 ms per token,    26.37 tokens per second)
       eval time =    2629.31 ms /   440 tokens (    5.98 ms per token,   167.34 tokens per second)
prompt eval time =      37.91 ms /     1 tokens (   37.91 ms per token,    26.38 tokens per second)
       eval time =    2628.70 ms /   440 tokens (    5.97 ms per token,   167.38 tokens per second)
prompt eval time =      38.20 ms /     1 tokens (   38.20 ms per token,    26.18 tokens per second)
       eval time =    2637.09 ms /   440 tokens (    5.99 ms per token,   166.85 tokens per second)

single P40

# speculative.n_max: 0
prompt eval time =   55669.14 ms /  8318 tokens (    6.69 ms per token,   149.42 tokens per second)
       eval time =   50050.73 ms /   437 tokens (  114.53 ms per token,     8.73 tokens per second)
prompt eval time =     114.98 ms /     1 tokens (  114.98 ms per token,     8.70 tokens per second)
       eval time =   50075.91 ms /   437 tokens (  114.59 ms per token,     8.73 tokens per second)
prompt eval time =     113.24 ms /     1 tokens (  113.24 ms per token,     8.83 tokens per second)
       eval time =   50097.56 ms /   437 tokens (  114.64 ms per token,     8.72 tokens per second)
       
# speculative.n_max: 16
prompt eval time =   55362.42 ms /  8318 tokens (    6.66 ms per token,   150.25 tokens per second)
       eval time =   29859.49 ms /   440 tokens (   67.86 ms per token,    14.74 tokens per second)
prompt eval time =     113.02 ms /     1 tokens (  113.02 ms per token,     8.85 tokens per second)
       eval time =   11146.53 ms /   440 tokens (   25.33 ms per token,    39.47 tokens per second)
prompt eval time =     113.75 ms /     1 tokens (  113.75 ms per token,     8.79 tokens per second)
       eval time =   11142.33 ms /   440 tokens (   25.32 ms per token,    39.49 tokens per second)
prompt eval time =     113.19 ms /     1 tokens (  113.19 ms per token,     8.83 tokens per second)
       eval time =   11175.47 ms /   440 tokens (   25.40 ms per token,    39.37 tokens per second)
prompt eval time =     112.65 ms /     1 tokens (  112.65 ms per token,     8.88 tokens per second)
       eval time =   11159.70 ms /   440 tokens (   25.36 ms per token,    39.43 tokens per second)

3xP40 (-sm row)

# speculative.n_max: 0
prompt eval time =   36909.28 ms /  8318 tokens (    4.44 ms per token,   225.36 tokens per second)
       eval time =   34412.92 ms /   437 tokens (   78.75 ms per token,    12.70 tokens per second)
prompt eval time =      79.49 ms /     1 tokens (   79.49 ms per token,    12.58 tokens per second)
       eval time =   34414.53 ms /   437 tokens (   78.75 ms per token,    12.70 tokens per second)
prompt eval time =      79.40 ms /     1 tokens (   79.40 ms per token,    12.60 tokens per second)
       eval time =   34413.66 ms /   437 tokens (   78.75 ms per token,    12.70 tokens per second)

# speculative.n_max: 16
prompt eval time =   36858.25 ms /  8318 tokens (    4.43 ms per token,   225.68 tokens per second)
       eval time =   27168.81 ms /   440 tokens (   61.75 ms per token,    16.20 tokens per second)
prompt eval time =      79.72 ms /     1 tokens (   79.72 ms per token,    12.54 tokens per second)
       eval time =    8290.25 ms /   440 tokens (   18.84 ms per token,    53.07 tokens per second)
prompt eval time =      79.73 ms /     1 tokens (   79.73 ms per token,    12.54 tokens per second)
       eval time =    8295.16 ms /   440 tokens (   18.85 ms per token,    53.04 tokens per second)
prompt eval time =      79.99 ms /     1 tokens (   79.99 ms per token,    12.50 tokens per second)
       eval time =    8295.91 ms /   440 tokens (   18.85 ms per token,    53.04 tokens per second)
prompt eval time =      79.88 ms /     1 tokens (   79.88 ms per token,    12.52 tokens per second)
       eval time =    8301.95 ms /   440 tokens (   18.87 ms per token,    53.00 tokens per second)

Code generated:

function! s:chunk_sim(c0, c1)
    let l:lines0 = join(a:c0, "\n")
    let l:lines1 = join(a:c1, "\n")

    let l:distance = levenshtein(l:lines0, l:lines1)

    return 1 - (l:distance / max([strlen(l:lines0), strlen(l:lines1)]))
endfunction

function! levenshtein(s1, s2)
    let l:len1 = strlen(a:s1)
    let l:len2 = strlen(a:s2)

    if l:len1 == 0
        return l:len2
    endif

    if l:len2 == 0
        return l:len1
    endif

    let l:dp = []
    for i in range(l:len1 + 1)
        call add(l:dp, [])
        for j in range(l:len2 + 1)
            call add(l:dp[i], 0)
        endfor
    endfor

    for i in range(l:len1 + 1)
        let l:dp[i][0] = i
    endfor

    for j in range(l:len2 + 1)
        let l:dp[0][j] = j
    endfor

    for i in range(1, l:len1 + 1)
        for j in range(1, l:len2 + 1)
            let l:cost = (strcharpart(a:s1, i - 1, 1) == strcharpart(a:s2, j - 1, 1)) ? 0 : 1
            let l:dp[i][j] = min([l:dp[i - 1][j] + 1, l:dp[i][j - 1] + 1, l:dp[i - 1][j - 1] + l:cost])
        endfor
    endfor

    return l:dp[l:len1][l:len2]
endfunction

mostlygeek · 2024-11-24T21:36:05Z

Also, is 0 and 16 the only valid values for speculative.n_max? I tried it with 4, 12, and got this error: common/common.cpp:1480: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

ggerganov · 2024-11-24T21:47:19Z

Thanks for the detailed tests. The results are inflated because there is one tricky side effect from the caching - consecutive runs with the same prompt will reuse the previous draft context which combined with greedy sampling would make the drafting instantaneous. So basically, in the following data for example, only the first result is relevant:

# speculative.n_max: 16
prompt eval time =    7915.24 ms /  8318 tokens (    0.95 ms per token,  1050.88 tokens per second)
       eval time =    9432.51 ms /   440 tokens (   21.44 ms per token,    46.65 tokens per second)    <--- only this is relevant
prompt eval time =      38.44 ms /     1 tokens (   38.44 ms per token,    26.02 tokens per second)
       eval time =    2626.82 ms /   440 tokens (    5.97 ms per token,   167.50 tokens per second)
prompt eval time =      37.93 ms /     1 tokens (   37.93 ms per token,    26.37 tokens per second)
       eval time =    2629.31 ms /   440 tokens (    5.98 ms per token,   167.34 tokens per second)
prompt eval time =      37.91 ms /     1 tokens (   37.91 ms per token,    26.38 tokens per second)
       eval time =    2628.70 ms /   440 tokens (    5.97 ms per token,   167.38 tokens per second)
prompt eval time =      38.20 ms /     1 tokens (   38.20 ms per token,    26.18 tokens per second)
       eval time =    2637.09 ms /   440 tokens (    5.99 ms per token,   166.85 tokens per second)

i.e. 46.65 t/s. The next runs are reusing the drafts and are not representative.

Also, is 0 and 16 the only valid values for speculative.n_max? I tried it with 4, 12, and got this error: common/common.cpp:1480: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

This was a bug - it is fixed now. You should be able to change n_max to any value. Btw, for CUDA it might make sense to set n_min to 0 or 1 and keep n_max ~ 16. But feel free to experiment.

Btw, here is another fun test that I came up with which uses less context and is suitable for speculation:

# get top 10 stories from Hacker News
hn=$(curl -s https://hacker-news.firebaseio.com/v0/topstories.json | jq -r '.[:10] | @tsv' | tr '\t' '\n' | xargs -I {} curl -s "https://hacker-news.firebaseio.com/v0/item/{}.json" | jq -sRr @json)

# make a Markdown table based on some criteria
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg hn "$hn" \
  '{ messages: [{ role: "system", content: "You are a helpful text-editing assistant. Respond only with the requested text. Do not add any other comments to your response." }, { role: "user", content: "Extract a Markdown table that contains only stories about software engineering, AI or machine learning from the front-page of HN. The table should include: author, title, score, comments and an URL to the story: ```\($hn)```." }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content

mostlygeek · 2024-11-24T22:14:22Z

i.e. 46.65 t/s. The next runs are reusing the drafts and are not representative.

Thanks. That seems a lot more realistic.

I did some tests with a much shorter prompt: "write snake game in swift"

GPU	`n_max:0`	`n_max:16`	change
P40	10.54 tps	17.11 tps	1.62x
3xP40 `-sm row`	16.22 tps	22.80 tps	1.4x
3090	34.78 tps	51.31 tps	1.47x

curl --request POST --url http://10.0.1.50:9999/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" '{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "write snake game in swift"}], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" | jq -r .choices[0].message.content;

ggerganov · 2024-11-24T22:23:35Z

These numbers look reasonable. The speedup can vary in both ways based on the inputs, but enabling speculative should almost never result in slower than normal decoding.

3Simplex · 2024-11-24T22:34:50Z

These numbers look reasonable. The speedup can vary in both ways based on the inputs, but enabling speculative should almost never result in slower than normal decoding.

With this build I am up to 25t/s on first run generation with speculative decoding using 15/5 draft tokens.

mostlygeek · 2024-11-25T03:56:46Z

A bit of data with llama-3.1 70B and llama-3.2 1B as the draft model. Prompt: "write a story about the natural resources in Canada".

GPU	`n_max:0`	`n_max:16`	change
3xP40 `-sm row`	9.80 tps	12.27 tps	1.25x

Server:

$ ./llama-server -m /mnt/nvme/models/Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf \
-md /mnt/nvme/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 10240 --draft-max 16 --draft-min 1 \
--host 0.0.0.0 -sm row

client (changed speculative.n_max between 0 and 16)

$ curl --request POST --url http://10.0.1.50:9999/v1/chat/completions \
-d "$(jq -n --arg code "$code" '{ messages: [{ role: "system", content: "You are a helpful AI."}, {role: "user",content: "write a story about the natural resources in Canada"}], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" \
| jq -r .choices[0].message.content;

ggerganov · 2024-11-25T06:58:19Z

Note that I am not very sure what happens with multiple GPUs, but it is possible that the draft model gets split across them, which is not desired (see the logs if that is the case). You would want to keep the draft model fully on one GPU.

ggml-ci

sorasoras · 2024-11-25T08:42:38Z

Note that I am not very sure what happens with multiple GPUs, but it is possible that the draft model gets split across them, which is not desired (see the logs if that is the case). You would want to keep the draft model fully on one GPU.

I wonder if it is possible to load draft and main model onto different backend. Ie a 7900xtx and P40 in a -cb process

vitobotta · 2024-11-25T19:28:14Z

@vitobotta You mention "acceptance rate" which makes me think you are doing something else that is not the topic of this PR. Make sure to read OP carefully and do the steps specified there. I am running this feature on Apple Silicon and it is designed specifically with limited-compute devices in mind.

Sorry, I don't know how to call it. I am referring to the accept value in the output like

n_draft   = 16
n_predict = 672
n_drafted = 2608
n_accept  = 508
accept    = 19.479%

I am still new to all of this but I thought that the higher that % the greater the speed improvement would be with this speculative decoding. Is it a wrong assumption?

JeroenAdam · 2024-11-25T19:29:25Z

@JeroenAdam I don't see how the KV cache quantization would affect the performance.

Thanks for this PR. +50% speed gain with 14b Q4_K_L & 0.5b Q4_0 on RTX 2070 Q-max + Quadro P5000. Now 28+ t/s on small prompt and 350-token response.

My mistake earlier, I've got an unrelated issue with my 8GB+16GB GPU setup. 8K context non-quantized without overflow to shared GPU-mem but 8K quantized Q4_0 context overflows 0.5GB per GPU. Enough NVRAM but overflows to slow RAM.

vitobotta · 2024-11-25T19:33:48Z

@ggerganov I am testing with this command:

llama-speculative -m $HOME/.cache/lm-studio/models/lmstudio-community/Qwen2.5-14B-Instruct-GGUF/Qwen2.5-14B-Instruct-Q4_K_M.gguf -p "what is kubernetes" -t 14 -ngl 1000 -ngld 1000 -fa -md  $HOME/.cache/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf --top-k 1 --draft-max 16 --draft-min 5

I have tried different values and different sizes for the draft model. Speed is similar as without the draft model config. What else can I try?

HabermannR · 2024-11-25T19:44:53Z

Using 2 Titan V, and this version: https://github.com/ggerganov/llama.cpp/actions/runs/12012234135/artifacts/2234213091 the speculative model is still on both cards, and the speed is less then without speculative model :(

I had the same and removed the KV cache quantization params, now enjoying the speed bump.

This gave only a minimal speedup, still slower then without draft model, but thanks.

JohannesGaessler · 2024-11-25T20:18:40Z

@ggerganov do you have an opinion on lookup decoding in the server? My previous attempts didn't work consistent or well enough where I felt the additional complexity in the server would be worthwhile, #8648 is the last update. But if there already is speculative decoding support the additional complexity would probably not be too high. Though now that llama.cpp training will soon be available it may make more sense to distill a model for speculative decoding.

Gobz · 2024-11-25T20:27:45Z

Trying to test mistral large with mistral 7B as draft model, the server throws an error
common_speculative_are_compatible: draft model vocab must match target model to use speculation but token 10 content differs - target '[IMG]', draft '[control_8]'

I see this mentioned on reddit, but people seem to have made it work using exlv2 and tabby?
https://www.reddit.com/r/LocalLLaMA/comments/1edm7xu/speculative_decoding_with_mistral_large_2_are/
Is the server too pedantic?

ggerganov · 2024-11-25T20:38:05Z

@JohannesGaessler I think we can experiment with lookup decoding and other similar approaches. The design currently is to provide the draft with the common_speculative_gen_draft() call:

llama.cpp/examples/speculative-simple/speculative-simple.cpp

Lines 138 to 147 in 9fd8c26

    
           while (true) { 
        
               // optionally, generate draft tokens that can be appended to the target batch 
        
               // 
        
               // this is the most important part of the speculation. the more probable tokens that are provided here 
        
               // the better the performance will be. in theory, this computation can be performed asynchronously and even 
        
               // offloaded to a remote device. it doesn't even have to be based on an LLM. instead, it can provide tokens 
        
               // from a cache or lookup tables. 
        
               // 
        
               llama_tokens draft = common_speculative_gen_draft(spec, params_spec, prompt_tgt, id_last);

This call simply has to provide the drafted tokens and it does not matter how exactly they are generated. So abstracting the common_speculative class a bit to add different implementations should be easy to merge. The rest of the speculative decoding logic should not be affected.

Btw, I am also interested in ways to make the draft generation asynchronous similar to the approach in #6853. However, the reconciliation logic might be more difficult to implement and might require too much changes in the examples and the server to support. So probably further in the future.

ggerganov · 2024-11-25T20:51:57Z

@Gobz You can try to disable the checks in common_speculative_are_compatible() and see if it works. I don't have the models handy to give them a try atm.

Gobz · 2024-11-25T21:10:07Z

@ggerganov Works great, it seems to be just the [IMG] and [control_8] tokens that are mismatched, so for general use it's fine if you don't need those.

dagbdagb · 2024-11-25T21:27:40Z

I've tested various sizes of Qwen 2.5 models, like the 0.5b, 3b, and 7b, as draft models, with the 32b model as the primary one. I also experimented with different values for --draft-min/max mentioned in this thread and set --top-k to 1. However, the acceptance rate only goes up to about 20-25%, and I haven't noticed any speed improvements yet. What other options do I have? I'm using an M4 Pro mini with 64 GB of memory.

SD gives a speedup when there is an abundance of compute compared to the available bandwidth. This will be the case for dedicated GPUs, but less so for SOCs as used in MACs

Do you care to speculate how this will impact inference on pure CPU then? The memory bandwidth on consumer CPUs is nothing to be impressed by.

slaren · 2024-11-25T21:35:42Z

Btw, I am also interested in ways to make the draft generation asynchronous similar to the approach in #6853.

A simple change that could already be implemented would be to process the prompt on the draft model and the main model simultaneously. This would work on backends that support async compute like CUDA, but for other backends using std::async to launch the call to llama_decode should also be a straightforward change. I think that to do this common_speculative_gen_draft needs to be split into a function for processing the prompt and another for drafting.

cb88 · 2024-11-25T23:21:20Z

https://www.amd.com/en/developer/resources/technical-articles/introducing-amd-first-slm-135m-model-fuels-ai-advancements.html

AMD was claiming pretty high acceptance rates for their model with llama 7B as the target. I wonder if those results can be repeated here. Also the speedup was even higher on CPUs MI250x is like 2.5x and CPU is like 3.5x (maybe this is because its a very simple model and the CPU has enough omph to run it as a draft, while some of the other draft models run too slowly on CPU to help).

The actual output of this model is pretty bad, alone ... but apparently its a good draft model? The interesting inference perf numbers are at the bottom of the page.

HabermannR · 2024-11-26T08:17:45Z

Sorry, I am going mad.
Using this command
call llama-speculative.exe -m Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf --model-draft Qwen2.5-Coder-0.5B-Instruct-Q6_K.gguf -t 14 --temp 0 -ngl 65 -ngld 99 -devd CUDA1 -fa -c 1000 --draft-max 16 --draft-min 5 --prompt "Once upon a time" -n 50
Result:

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)


Once upon a time, in a land far, far away, there was a magical kingdom called Eldoria. Eldoria was known for its lush forests, sparkling rivers, and towering mountains. The kingdom was ruled by a wise and benevolent queen named Elara, who was

encoded    4 tokens in    0.109 seconds, speed:   36.643 t/s
decoded   51 tokens in    4.096 seconds, speed:   12.450 t/s

n_draft   = 16
n_predict = 51
n_drafted = 288
n_accept  = 32
accept    = 11.111%

draft:

llama_perf_context_print:        load time =     924.05 ms
llama_perf_context_print: prompt eval time =    2166.17 ms /    39 tokens (   55.54 ms per token,    18.00 tokens per second)
llama_perf_context_print:        eval time =    1660.31 ms /   270 runs   (    6.15 ms per token,   162.62 tokens per second)
llama_perf_context_print:       total time =    4211.06 ms /   309 tokens

target:

llama_perf_sampler_print:    sampling time =       8.79 ms /    51 runs   (    0.17 ms per token,  5802.71 tokens per second)
llama_perf_context_print:        load time =   12781.68 ms
llama_perf_context_print: prompt eval time =    2171.88 ms /   310 tokens (    7.01 ms per token,   142.73 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   -8001.85 ms /   311 tokens

But using this command:
call llama-server.exe -m Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf --model-draft Qwen2.5-Coder-0.5B-Instruct-Q6_K.gguf -t 14 --temp 0 -ngl 65 -ngld 99 -devd CUDA1 --port 1234 -fa -c 1000 --draft-max 16 --draft-min 5

And then call it:

$body = @{
    messages = @(
        @{
            role = "system"
            content = "You are a helpful AI."
        },
        @{
            role = "user"
            content = "write a story about the natural resources in Canada"
        }
    )
    cache_prompt = $true
    top_k = 1
    samplers = @("top_k")
    "speculative.n_max" = 0
} | ConvertTo-Json

$response = Invoke-RestMethod -Method Post -Uri "http://127.0.0.1:1234/v1/chat/completions" -Body $body -ContentType "application/json"

$response.choices[0].message.content

I get

request: POST /v1/chat/completions 127.0.0.1 200
slot launch_slot_: id  0 | task 894 | processing task
slot update_slots: id  0 | task 894 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 28
slot update_slots: id  0 | task 894 | need to evaluate at least 1 token to generate logits, n_past = 28, n_prompt_tokens = 28
slot update_slots: id  0 | task 894 | kv cache rm [27, end)
slot update_slots: id  0 | task 894 | prompt processing progress, n_past = 28, n_tokens = 1, progress = 0.035714
slot update_slots: id  0 | task 894 | prompt done, n_past = 28, n_tokens = 1
slot      release: id  0 | task 894 | stop processing: n_past = 918, truncated = 0
slot print_timing: id  0 | task 894 |
prompt eval time =     303.02 ms /     1 tokens (  303.02 ms per token,     3.30 tokens per second)
       eval time =   41510.37 ms /   891 tokens (   46.59 ms per token,    21.46 tokens per second)
      total time =   41813.39 ms /   892 tokens
srv  update_slots: all slots are idle

Is the server using speculative decoding? I feel its not. Why?

countzero · 2024-11-26T09:05:58Z

Just dropping my results for speculative decoding on a NVIDIA GeForce RTX 4070 Ti SUPER:

without a draft model: 37 t/s

llama-cli `
    --model '.\vendor\llama.cpp\models\Qwen2.5-Coder-32B-Instruct.IQ3_XXS.gguf' `
    --ctx-size 16384 `
    --threads 16 `
    --n-gpu-layers '99' `
    --cache-type-k 'q8_0' `
    --cache-type-v 'q8_0' `
    --flash-attn `
    --prompt "Write tetris in JavaScript"

llama_perf_sampler_print:    sampling time =     168.65 ms /  1778 runs   (    0.09 ms per token, 10542.48 tokens per second)
llama_perf_context_print:        load time =    4605.98 ms
llama_perf_context_print: prompt eval time =      44.66 ms /     5 tokens (    8.93 ms per token,   111.96 tokens per second)
llama_perf_context_print:        eval time =   47885.40 ms /  1772 runs   (   27.02 ms per token,    37.01 tokens per second)
llama_perf_context_print:       total time =   48381.41 ms /  1777 tokens

with a draft model: 31 t/s

llama-speculative `
    --model '.\vendor\llama.cpp\models\Qwen2.5-Coder-32B-Instruct.IQ3_XXS.gguf' `
    --ctx-size 16384 `
    --threads 16 `
    --n-gpu-layers 99 `
    --cache-type-k 'q8_0' `
    --cache-type-v 'q8_0' `
    --flash-attn `
    --model-draft '.\vendor\llama.cpp\models\Qwen2.5-Coder-0.5B-Instruct.IQ4_XS.gguf' `
    --ctx-size-draft 16384 `
    --n-gpu-layers-draft 99 `
    --draft-min 5 `
    --draft-max 16 `
    --prompt "Write tetris in JavaScript"

encoded    5 tokens in    0.545 seconds, speed:    9.169 t/s
decoded 2008 tokens in   63.483 seconds, speed:   31.630 t/s

n_draft   = 16
n_predict = 2008
n_drafted = 3840
n_accept  = 1767
accept    = 46.016%

draft:

llama_perf_context_print:        load time =     487.35 ms
llama_perf_context_print: prompt eval time =   20506.28 ms /   484 tokens (   42.37 ms per token,    23.60 tokens per second)
llama_perf_context_print:        eval time =   41846.73 ms /  3600 runs   (   11.62 ms per token,    86.03 tokens per second)
llama_perf_context_print:       total time =   64032.93 ms /  4084 tokens

target:

llama_perf_sampler_print:    sampling time =     245.01 ms /  2008 runs   (    0.12 ms per token,  8195.58 tokens per second)
llama_perf_context_print:        load time =    4826.09 ms
llama_perf_context_print: prompt eval time =   14649.65 ms /  4085 tokens (    3.59 ms per token,   278.85 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   59560.54 ms /  4086 tokens

Questions

Is the GPUs VRAM not the bottleneck in this card, or am I missing something?
@ggerganov It would be nice to add the draft model parameters to llama-bench so that we can easily share results on the new feature.

Gomez12 · 2024-11-26T09:24:35Z

If I run parallel requests, then I get the error : llama_get_logits_ith: invalid logits id 48, reason: batch.logits[48] != true
and the server stops.

It seems not to be in the parallel option, if I run it with parallel 8 but I serialise the requests from the client then it goes okay, if I set the client to parallelise 2 requests then it comes after a time, but if I set the client to parallelise 8 requests then it is almost immediate.

It looks like the continuous batching removes items before they are done, if I disable continuous batching then the speculative decoding gives no error.

JohannesGaessler · 2024-11-26T10:20:56Z

@countzero llama-bench does not read in or generate any actual text, it's only determining how fast the model evaluation is using toy data. So determining speculative decoding performance would require a major rewrite of the module.

ggerganov · 2024-11-26T10:29:18Z

@Gomez12 Please check if #10513 fixes the issue.

Gomez12 · 2024-11-26T11:10:12Z

@ggerganov, that fix seems to work perfectly, thnx for the extremely quick fix.

vitobotta · 2024-11-26T11:45:36Z

Could someone help me out? I'm trying to figure out where I'm going wrong.

I have an M4 Pro with 64 GB of memory and when I use the 32-bit Qwen models (both the regular and coder versions) with Llama.cpp, I usually get about 11 tokens per second. I'm trying to see if I can boost the speed by using speculative decoding, but I haven't had much luck so far.

For instance, when I run the following command:

llama-speculative -m $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -p "write a ruby script to count the files in a directory recursively" -ngl 1000 -ngld 1000 -fa -md  $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_0.gguf --top-k 1 --draft-max 16 --draft-min 5

I get this output:

encoded   12 tokens in    0.514 seconds, speed:   23.325 t/s
decoded  371 tokens in   34.412 seconds, speed:   10.781 t/s

n_draft   = 16
n_predict = 371
n_drafted = 912
n_accept  = 313
accept    = 34.320%

draft:

llama_perf_context_print:        load time =     273.47 ms
llama_perf_context_print: prompt eval time =   25054.05 ms /   125 tokens (  200.43 ms per token,     4.99 tokens per second)
llama_perf_context_print:        eval time =    8806.60 ms /   855 runs   (   10.30 ms per token,    97.09 tokens per second)
llama_perf_context_print:       total time =   34928.33 ms /   980 tokens

target:

llama_perf_sampler_print:    sampling time =      16.07 ms /   371 runs   (    0.04 ms per token, 23089.37 tokens per second)
llama_perf_context_print:        load time =    1120.41 ms
llama_perf_context_print: prompt eval time =   24730.76 ms /   981 tokens (   25.21 ms per token,    39.67 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   35201.82 ms /   982 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

There's no noticeable speed improvement. I also tried running the server with:

llama-server -m $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -ngl 99 -ngld 99 -fa -md  $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_0.gguf --top-k 1 --draft-max 16 --draft-min 5 --port 8033

But I still see the same token speed and no improvement. What am I missing here?

PkmX · 2024-11-26T17:22:33Z

I wonder if the default p-min of 0.9 is too high. I can get a further 20-30% speedup by setting a lower --draft-p-min in llama-server.

GPU: RTX 4060Ti 16GB
Model: Qwen2.5-Coder-32B-Instruct-IQ3_M.gguf
Draft: Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
Sampling parameters: --temp 0 --top-k 1

`p-min`	`Write quicksort in C++`	`How do transformers in LLM work?`
No SD	~17.3	~17.2
0.9	~25.0	~18.3
0.8	~27.7
0.7	~29.6	~20.4
0.6	~30.7	~21.9
0.5	~32.0	~19.7
0.4	~30.7
0.3	~30.0

ggerganov · 2024-11-26T20:00:00Z

@PkmX The p-min = 0.9 is very conservative. The idea is to enable the speculation only for blocks of tokens where the LLM is very confident. With CUDA, it might be better to reduce p-min and also n-min. Feel free to experiment.

Mushoz · 2024-11-26T20:01:23Z

Why am I getting a consistent 60 tokens/sec with llama-speculative while only 40 tokens/s through llama-server? Using the following two commands:

llama-specutalive:

./llama-speculative -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -md /models/Qwen2.5.1-Coder-1.5B-Instruct-IQ4_XS.gguf -ngld 99 --draft 10 -p "Please write a minesweeper game using html, js and css. Do not output any explanations. Only give me the 3 different files each in its own codeblock." --top-k 1 --temp 0.0

llama-server:

./llama-server --host 0.0.0.0 --port 8999 -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -md /models/Qwen2.5.1-Coder-1.5B-Instruct-IQ4_XS.gguf -ngld 99 --draft 10 --top-k 1 --temp 0.0

And then querying the exact same prompt through openweb ui, with temperature set to 0 and top-k to 1.

Is there anything that can explain this rather big discrepancy?

llama-speculative: decoded 1332 tokens in 22.143 seconds, speed: 60.155 t/s
llama-server: eval time = 31188.10 ms / 1281 tokens ( 24.35 ms per token, 41.07 tokens per second)

Mushoz · 2024-11-26T20:25:41Z

Quick update: Dropping p-min increased the tokens/second for llama-server. I maxed out the speed at 53 tokens/second at 0.4 p-min, which remaining at 53 tokens/second all the way down to 0. Two questions that come to mind:

Why do I see a performance increase lowering p-min with llama-server, but not with llama-speculative. Both have the same default AFAIK?
While performance is better, it's still not quite at llama-speculative levels. How do I obtain the same performance there?

Update 2:

Managed to obtain the following result: eval time = 21349.26 ms / 1292 tokens ( 16.52 ms per token, 60.52 tokens per second)

This was obtained through the following command: ./llama-server --host 0.0.0.0 --port 8999 -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -md /models/Qwen2.5.1-Coder-1.5B-Instruct-IQ4_XS.gguf -ngld 99 --top-k 1 --temp 0.0 --draft-p-min 0 --draft-max 15

Over 60 tokens/second on a single 7900XTX! What a time to be alive :) Thank you so much for all your hard work @ggerganov ! Still very curious why I need different settings between llama-speculative and llama-server, but at least I am extremely happy I was able to fully unlock the potential of my 7900XTX

ggerganov · 2024-11-26T20:34:20Z

@Mushoz llama-speculative is a different implementation of speculative decoding which is mainly used for experimenting and research. In some cases it will be more efficient, in other cases it will be less efficient. The llama-server support which is added here is a conservative approach (greedy speculation) which primarily aims to not make the performance worse than without using SD, across a larger variety of hardware - not just CUDA. With time, it will be improved. Also, you cannot really evaluate the SD gains based on individual prompts like this - the SD gains vary widely on the specific input. And using 3rd party clients introduces unknowns that we cannot analyze. The only expected result is that if you use the instructions in OP, the performance will not be worse compared to SD-less inference.

github-actions bot added examples server labels Nov 22, 2024

ggerganov force-pushed the gg/speculative-server branch from 1973399 to 7dc6ae5 Compare November 22, 2024 14:12

ggerganov force-pushed the gg/speculative-server branch 5 times, most recently from c5ddee2 to e80f758 Compare November 24, 2024 15:09

ggerganov marked this pull request as ready for review November 24, 2024 15:11

ggerganov mentioned this pull request Nov 24, 2024

speculative : refactor and add a simpler example #10362

Merged

ggerganov force-pushed the gg/speculative-server branch from e80f758 to d905266 Compare November 24, 2024 19:59

Base automatically changed from gg/speculative-refactor to master November 25, 2024 07:58

server : add speculative decoding support

156aa6d

ggml-ci

ggerganov force-pushed the gg/speculative-server branch from c277c4d to 156aa6d Compare November 25, 2024 08:05

server : add helper function slot.can_speculate()

0ba40c3

ggml-ci

ggerganov merged commit 9ca2e67 into master Nov 25, 2024
62 checks passed

Mushoz mentioned this pull request Nov 25, 2024

[Feature] Support llama.cpp cache_prompt parameter mostlygeek/llama-swap#16

Closed

ggerganov added a commit that referenced this pull request Nov 25, 2024

server : add more information about error (#10455)

9fd8c26

stepfunction83 mentioned this pull request Nov 25, 2024

Enhancement: Speculative decoding – load 2 models at the same time! LostRuins/koboldcpp#1207

Open

ggerganov mentioned this pull request Nov 26, 2024

speculative : add infill mode #10510

Draft

ggerganov mentioned this pull request Nov 26, 2024

server : fix parallel speculative decoding #10513

Merged

ngxson mentioned this pull request Nov 26, 2024

server : replace behave with pytest #10416

Merged

5 tasks

server : add speculative decoding support #10455

server : add speculative decoding support #10455

Conversation

ggerganov commented Nov 22, 2024 • edited Loading

3Simplex commented Nov 22, 2024 • edited Loading

ggerganov commented Nov 24, 2024

3Simplex commented Nov 24, 2024

mostlygeek commented Nov 24, 2024

ggerganov commented Nov 24, 2024 • edited Loading

mostlygeek commented Nov 24, 2024 • edited Loading

mostlygeek commented Nov 24, 2024 • edited Loading

ggerganov commented Nov 24, 2024 • edited Loading

mostlygeek commented Nov 24, 2024

ggerganov commented Nov 24, 2024

3Simplex commented Nov 24, 2024

mostlygeek commented Nov 25, 2024 • edited Loading

ggerganov commented Nov 25, 2024

sorasoras commented Nov 25, 2024

vitobotta commented Nov 25, 2024

JeroenAdam commented Nov 25, 2024 • edited Loading

vitobotta commented Nov 25, 2024

HabermannR commented Nov 25, 2024

JohannesGaessler commented Nov 25, 2024

Gobz commented Nov 25, 2024

ggerganov commented Nov 25, 2024 • edited Loading

ggerganov commented Nov 25, 2024

Gobz commented Nov 25, 2024

dagbdagb commented Nov 25, 2024

slaren commented Nov 25, 2024 • edited Loading

cb88 commented Nov 25, 2024 • edited Loading

HabermannR commented Nov 26, 2024

countzero commented Nov 26, 2024

without a draft model: 37 t/s

with a draft model: 31 t/s

Questions

Gomez12 commented Nov 26, 2024 • edited Loading

JohannesGaessler commented Nov 26, 2024

ggerganov commented Nov 26, 2024

Gomez12 commented Nov 26, 2024

vitobotta commented Nov 26, 2024

PkmX commented Nov 26, 2024 • edited Loading

ggerganov commented Nov 26, 2024

Mushoz commented Nov 26, 2024 • edited Loading

Mushoz commented Nov 26, 2024 • edited Loading

ggerganov commented Nov 26, 2024

ggerganov commented Nov 22, 2024 •

edited

Loading

3Simplex commented Nov 22, 2024 •

edited

Loading

ggerganov commented Nov 24, 2024 •

edited

Loading

mostlygeek commented Nov 24, 2024 •

edited

Loading

mostlygeek commented Nov 24, 2024 •

edited

Loading

ggerganov commented Nov 24, 2024 •

edited

Loading

mostlygeek commented Nov 25, 2024 •

edited

Loading

JeroenAdam commented Nov 25, 2024 •

edited

Loading

ggerganov commented Nov 25, 2024 •

edited

Loading

slaren commented Nov 25, 2024 •

edited

Loading

cb88 commented Nov 25, 2024 •

edited

Loading

Gomez12 commented Nov 26, 2024 •

edited

Loading

PkmX commented Nov 26, 2024 •

edited

Loading

Mushoz commented Nov 26, 2024 •

edited

Loading

Mushoz commented Nov 26, 2024 •

edited

Loading