Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Qwen2.5-32b-int4用vllm跑 好像只会生成感叹号 #1103

Open
4 tasks
ciaoyizhen opened this issue Nov 26, 2024 · 4 comments
Open
4 tasks

[Bug]: Qwen2.5-32b-int4用vllm跑 好像只会生成感叹号 #1103

ciaoyizhen opened this issue Nov 26, 2024 · 4 comments
Labels
duplicate This issue or pull request already exists help wanted Extra attention is needed

Comments

@ciaoyizhen
Copy link

ciaoyizhen commented Nov 26, 2024

Model Series

Qwen2.5

What are the models used?

Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4

What is the scenario where the problem happened?

vllm

Is this a known issue?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find an answer there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

OS: centos7.9
python: python 3.12
GPUs: 4 x NVIDIA v100
NVIDIA driver: 545.23.06
CUDA compiler: not found
pytorch: 2.3.0
vllm: 0.5.1

Log output

正常返回
但是返回的结果全是"!!!!!"

Description

我把代码贴这,但是目前直接使用huggingface推理是正常的

from transformers import AutoTokenizer
from vllm import LLM

model_path = "Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4"
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model_path)

messages = [
    {
        "role": "user",
        "content": "你好"
    }
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
for output in llm.generate(input_text):
    print(output)

直接使用huggingface的推理是正常返回的,现在这个也能返回 但是全是"!!!" 想知道为啥或者我哪里写错了?

@ciaoyizhen
Copy link
Author

不int 比如我用3b跑都是正常的

@jklj077 jklj077 added duplicate This issue or pull request already exists help wanted Extra attention is needed labels Nov 26, 2024
@YorkSu
Copy link

YorkSu commented Nov 29, 2024

It looks like your vLLM is out of date. Try upgrade to vllm==0.6.4.post1 and generate again?

@jklj077
Copy link
Collaborator

jklj077 commented Nov 29, 2024

the gptq implementaion in vllm is known to produce invalid results for the 32B-GPTQ-Int4 model for sequences with tokens fewer than 50 and certain prompts (the default system prompt appears to be one of them).

in that case, the gptq implementation uses the fast path instead of the reconstruct path (dequant then matmul). the issue appears to be related to numerical stability in the fast path, but it is unexpected because the fast path is basically the exllama_v2 implementation, which auto_gptq also builds upon and auto_gptq does not have the same problem. it may have deeper issues which we are still looking into together with vLLM team.

we currently find two workarounds

  • use gptq_marlin, which is available for Ampere and later cards.
  • change the number on this line from 50 to 0 and install from the modified source code. it may affect speed on short sequences though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants