Bug: flash-attn can't use #10378
Labels
bug-unconfirmed
low severity
Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
What happened?
I want to quantize KV cache in the form of q8_0, but the following error occurs:
llama_new_context_with_model: V cache quantization requires flash_attn
common_init_from_params: failed to create context with model '/home/albert/work/code/models/chatglm4-9B.guff'
main: error: unable to load model
After installing the flash-attn package, this error still occurs
How to deal this problem?
Name and Version
command : ./llama-cli -m ~/work/code/models/chatglm4-9B.guff -b 1024 -ctk q8_0 -ctv q8_0 -ngl 256 -p 给我讲个笑话吧
torch version:2.5.1
cuda version:12.4
flash-attn version:2.7.0.post2
What operating system are you seeing the problem on?
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: