You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the gptq implementaion in vllm is known to produce invalid results for the 32B-GPTQ-Int4 model for sequences with tokens fewer than 50 and certain prompts (the default system prompt appears to be one of them).
in that case, the gptq implementation uses the fast path instead of the reconstruct path (dequant then matmul). the issue appears to be related to numerical stability in the fast path, but it is unexpected because the fast path is basically the exllama_v2 implementation, which auto_gptq also builds upon and auto_gptq does not have the same problem. it may have deeper issues which we are still looking into together with vLLM team.
we currently find two workarounds
use gptq_marlin, which is available for Ampere and later cards.
change the number on this line from 50 to 0 and install from the modified source code. it may affect speed on short sequences though.
Model Series
Qwen2.5
What are the models used?
Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4
What is the scenario where the problem happened?
vllm
Is this a known issue?
Information about environment
Log output
正常返回 但是返回的结果全是"!!!!!"
Description
我把代码贴这,但是目前直接使用huggingface推理是正常的
直接使用huggingface的推理是正常返回的,现在这个也能返回 但是全是"!!!" 想知道为啥或者我哪里写错了?
The text was updated successfully, but these errors were encountered: