[Bug]: Qwen2.5-32b-int4用vllm跑好像只会生成感叹号 #1103

ciaoyizhen · 2024-11-26T02:01:49Z

Model Series

Qwen2.5

What are the models used?

Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4

What is the scenario where the problem happened?

vllm

Is this a known issue?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

OS: centos7.9
python: python 3.12
GPUs: 4 x NVIDIA v100
NVIDIA driver: 545.23.06
CUDA compiler: not found
pytorch: 2.3.0
vllm: 0.5.1

Log output

正常返回
但是返回的结果全是"!!!!!"

Description

我把代码贴这，但是目前直接使用huggingface推理是正常的

from transformers import AutoTokenizer
from vllm import LLM

model_path = "Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4"
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model_path)

messages = [
    {
        "role": "user",
        "content": "你好"
    }
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
for output in llm.generate(input_text):
    print(output)

直接使用huggingface的推理是正常返回的，现在这个也能返回但是全是"!!!" 想知道为啥或者我哪里写错了？

The text was updated successfully, but these errors were encountered:

ciaoyizhen · 2024-11-26T02:02:13Z

不int 比如我用3b跑都是正常的

jklj077 · 2024-11-26T03:13:39Z

[Badcase]: Model inference Qwen2.5-32B-Instruct-GPTQ-Int4 appears as garbled text !!!!!!!!!!!!!!!!!! #945
[Bug]: 使用 Xinference vLLM 启动 qwen2.5-32b-instruct 推理结果都是感叹号 #1038
[Bug]: vllm infer Qwen2.5-32B-Instruct-AWQ with 2 * Nvidia-L20, output repeat !!!! #1090

YorkSu · 2024-11-29T02:22:14Z

It looks like your vLLM is out of date. Try upgrade to vllm==0.6.4.post1 and generate again?

jklj077 · 2024-11-29T03:21:49Z

the gptq implementaion in vllm is known to produce invalid results for the 32B-GPTQ-Int4 model for sequences with tokens fewer than 50 and certain prompts (the default system prompt appears to be one of them).

in that case, the gptq implementation uses the fast path instead of the reconstruct path (dequant then matmul). the issue appears to be related to numerical stability in the fast path, but it is unexpected because the fast path is basically the exllama_v2 implementation, which auto_gptq also builds upon and auto_gptq does not have the same problem. it may have deeper issues which we are still looking into together with vLLM team.

we currently find two workarounds

use gptq_marlin, which is available for Ampere and later cards.
change the number on this line from 50 to 0 and install from the modified source code. it may affect speed on short sequences though.

jklj077 added duplicate This issue or pull request already exists help wanted Extra attention is needed labels Nov 26, 2024

jklj077 mentioned this issue Nov 26, 2024

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! vllm-project/vllm#10656

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Qwen2.5-32b-int4用vllm跑好像只会生成感叹号 #1103

[Bug]: Qwen2.5-32b-int4用vllm跑好像只会生成感叹号 #1103

ciaoyizhen commented Nov 26, 2024 •

edited

Loading

ciaoyizhen commented Nov 26, 2024

jklj077 commented Nov 26, 2024

YorkSu commented Nov 29, 2024

jklj077 commented Nov 29, 2024

[Bug]: Qwen2.5-32b-int4用vllm跑 好像只会生成感叹号 #1103

[Bug]: Qwen2.5-32b-int4用vllm跑 好像只会生成感叹号 #1103

Comments

ciaoyizhen commented Nov 26, 2024 • edited Loading

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this a known issue?

Information about environment

Log output

Description

ciaoyizhen commented Nov 26, 2024

jklj077 commented Nov 26, 2024

YorkSu commented Nov 29, 2024

jklj077 commented Nov 29, 2024

[Bug]: Qwen2.5-32b-int4用vllm跑好像只会生成感叹号 #1103

[Bug]: Qwen2.5-32b-int4用vllm跑好像只会生成感叹号 #1103

ciaoyizhen commented Nov 26, 2024 •

edited

Loading