Refactor into package #4

mgoin · 2024-05-10T21:43:57Z

Turn this repo into an installable python project! Now there is a python interface for quantizing, calibrating, and saving your models as FP8

Example:

from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = ["auto_fp8 is an easy-to-use model quantization library"]
examples = tokenizer(examples, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Then, load it into vLLM for inference!

pip install vllm==0.4.2

from vllm import LLM

model = LLM("Meta-Llama-3-8B-Instruct-FP8")
INFO 05-10 18:02:40 model_runner.py:175] Loading model weights took 8.4595 GB

print(model.generate("Once upon a time"))
# [RequestOutput(request_id=0, prompt='Once upon a time', prompt_token_ids=[128000, 12805, 5304, 264, 892], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' there was a man who fell in love with a woman. The man was so', token_ids=[1070, 574, 264, 893, 889, 11299, 304, 3021, 449, 264, 5333, 13, 578, 893, 574, 779], cumulative_logprob=-21.314169232733548, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715378569.478381, last_token_time=1715378569.478381, first_scheduled_time=1715378569.480648, first_token_time=1715378569.7070432, time_in_queue=0.002267122268676758, finished_time=1715378570.104807), lora_request=None)]

mgoin added 8 commits May 10, 2024 17:34

Update quantize.py

424a4c9

Big refactor into package

4e6ed88

Add gha

b6c1005

Cleanup

3b16e08

Last fixes

c062b1f

Update

935dd70

.

332d98a

.

e7ff7bf

mgoin merged commit a111911 into main May 10, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor into package #4

Refactor into package #4

mgoin commented May 10, 2024 •

edited

Loading

Refactor into package #4

Refactor into package #4

Conversation

mgoin commented May 10, 2024 • edited Loading

mgoin commented May 10, 2024 •

edited

Loading