Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! #10656

Open
1 task done
jklj077 opened this issue Nov 26, 2024 · 4 comments
Open
1 task done

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! #10656

jklj077 opened this issue Nov 26, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@jklj077
Copy link

jklj077 commented Nov 26, 2024

Your current environment

The output of `python collect_env.py`

N/A; happened to multiple users.

Model Input Dumps

No response

🐛 Describe the bug

We have been receiving reports that the 4-bit GPTQ version of Qwen2.5-32B-Instruct cannot be used with vllm. The generation only contains !!!!!. However, it was also reported that the same model worked using transformers and auto_gptq.

Here are some related issues:

We attempted to reproduce the issue, which appears related to quantization kernels, and the following is a summary:

  • gptq_marlin works
  • gptq fails for requests with len(prompt_token_ids)<=50 but works for longer input sequences

The results are consistent for

  • tensor-parallel-size: 2, 4, 8
  • vllm versions: v0.6.1.post2, v0.6.2, v0.6.3.post1, v0.6.4.post1
  • nvidia driver versions: 535.183.06, 560.35.05

As gpt_marlin is not available for turing and volta cards, we are not able to find a workaround for those users. It would help a lot if one could help investigate the issue.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@jklj077 jklj077 added the bug Something isn't working label Nov 26, 2024
@youkaichao
Copy link
Member

cc @robertgshaw2-neuralmagic

@youqugit
Copy link

I encountered the same issue, only the /chat/completions endpoint returns an error that many !!!!!, while the /completions endpoint works fine.

vLLM version: 0.6.1

@DarkLight1337
Copy link
Member

Also cc @mgoin

@mgoin
Copy link
Collaborator

mgoin commented Nov 26, 2024

As far as I can tell the gptq kernel hasn't been touched all year, the last change was #2330 by @chu-tianxiang

This may be a fundamental issue with the kernel for this model, someone would need to dive in and learn about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants