Bug: flash-attn can't use #10378

Tangzhongyi834 · 2024-11-18T10:48:17Z

What happened?

I want to quantize KV cache in the form of q8_0, but the following error occurs:

llama_new_context_with_model: V cache quantization requires flash_attn
common_init_from_params: failed to create context with model '/home/albert/work/code/models/chatglm4-9B.guff'
main: error: unable to load model

After installing the flash-attn package, this error still occurs

How to deal this problem?

Name and Version

command : ./llama-cli -m ~/work/code/models/chatglm4-9B.guff -b 1024 -ctk q8_0 -ctv q8_0 -ngl 256 -p 给我讲个笑话吧
torch version:2.5.1
cuda version:12.4
flash-attn version:2.7.0.post2

What operating system are you seeing the problem on?

No response

Relevant log output

No response

wooooyeahhhh · 2024-11-18T13:24:42Z

llama.cpp doesn't use python. Try the -fa argument to enable flash attention

Tangzhongyi834 · 2024-11-19T07:49:16Z

llama.cpp doesn't use python. Try the -fa argument to enable flash attention

Got it ,thank you

Tangzhongyi834 added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: flash-attn can't use #10378

Bug: flash-attn can't use #10378

Tangzhongyi834 commented Nov 18, 2024

wooooyeahhhh commented Nov 18, 2024

Tangzhongyi834 commented Nov 19, 2024

Bug: flash-attn can't use #10378

Bug: flash-attn can't use #10378

Comments

Tangzhongyi834 commented Nov 18, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

wooooyeahhhh commented Nov 18, 2024

Tangzhongyi834 commented Nov 19, 2024