-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Heavy throttling during token generation on Apple Silicon #10444
Comments
I have coordinated with an Apple Senior Specialist in attempt to resolve this issue. Under their advice, I have tested this throttling under many different conditions (such as with a brand new OS with nothing installed), but the issue remained. I have also provided them with screen recordings of when the throttling occurred, along with detailed diagnostics obtained with the proprietary Capture Data tool by Apple. With this data, they concluded that there was no hardware issue with my device, and that there was no overheating during the tests. |
I see the same throttling with my M3 Max (same config as yours), but there is not much we can do about that. llama.cpp does not keep any state between runs, the issue is entirely within the OS or hardware. |
It's probably something related to M3, because I don't reproduce it neither on MacBook M1 Pro, nor Mac Studio M2 Ultra: ./llama-bench -m models/qwen2.5-32b-coder-instruct/ggml-model-q4_k.gguf -mmp 0 -fa 1 -p 0 \
-n 32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32, MacBook M1 Pro
build: 3ee6382 (4132) M2 Ultra
build: 1bb30bf (4149) |
It does not seem to reproduce on M4 Mac Mini either:
|
@ggerganov Can you reproduce it with the exact model and steps (with sleep 10)? ./llama-gguf-split --merge qwen2.5-72b-instruct-q4_0-00001-of-00011.gguf qwen2.5-72b-instruct-q4_0.gguf |
Here is M2 Max behavior. I cannot run that qwen 72b model because my M2 Max is has 32GB DDR
You can check the actual frequencies and power with
As slaren already explained there is not much we can do. |
@Azirine Here is the same steps as yours using $ ▶ bash test.sh
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.90 ± 0.04 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.96 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.96 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.95 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.97 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.96 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.98 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.97 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.97 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.98 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.97 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 15.00 ± 0.03 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.97 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.97 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.99 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.99 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.99 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.99 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.99 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.98 ± 0.00 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.99 ± 0.01 |
build: 6dfcfef0 (4153)
+ sleep 10
+ ./llama-bench -m ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.39 GiB | 72.71 B | Metal,BLAS | 16 | 1 | 0 | tg32 | 14.99 ± 0.01 |
build: 6dfcfef0 (4153)
sha256sum ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf
6ad1ee4cd0330387434608b20dd0ebd26bc9a9355abb9042166d587ef6e17538 ./models/qwen2.5-72b-instruct/ggml-model-q4_0.gguf
|
What happened?
There is heavy throttling during token generation on Apple Silicon. The machine tested is MacBook Pro 14" M3 Max with 128 GB memory. In my experience, throttling occurs more often with larger models (≥70B). Qwen 72B Q4_0 GGUF is tested in this case, although throttling does not happen exclusively with this model.
The tests were performed under high-power mode with the original 96W adapter plugged in, to ensure that the machine is not power limited. The max core temperature during throttling (middle of the 4th run in this case) hovered between 60-70°C, meaning the throttling should not be due to thermal limitations. I have experienced this issue for months across many different versions of llama.cpp, so it is not version specific.
Name and Version
version: 4104 (0fff7fd)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.1.0
What operating system are you seeing the problem on?
Mac
Relevant log output
The text was updated successfully, but these errors were encountered: