Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

vmpuri · 2024-10-24T22:51:01Z

Replace the WeightOnlyInt8Linear quantization code with TorchAO's int8_weight_only quantization.

Note - this commit also contains lintrunner changes.

Testing:

python3 torchchat.py eval llama3.2-1b --quantize '{"linear:int8": {"groupsize": 0}, "executor":{"accelerator":"cuda"}}' --compile
Using device=cuda
Loading model...
Time to load model: 1.21 seconds
Quantizing the model with: {'linear:int8': {'groupsize': 0}, 'executor': {'accelerator': 'cuda'}}
quantizer is linear int8
Time to quantize model: 0.31 seconds
-----------------------------------------------------------
2024-10-24:15:55:20,261 INFO     [huggingface.py:162] Using device 'cuda'
2024-10-24:15:55:27,792 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:55:27,792 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:55:27,792 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:55:27,792 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:55:27,792 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-10-24:15:55:27,792 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:55:28,687 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:55:28,760 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 501.80it/s]
2024-10-24:15:55:28,889 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
100%|████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [01:10<00:00,  1.13s/it]
Time to run eval: 78.96s.
Time in model.forward: 62.57s, over 162 model evaluations
forward run time stats - Median: 0.00s Min: 0.00s Max: 41.80s
For model /home/puri/.torchchat/model-cache/meta-llama/Meta-Llama-3.2-1B-Instruct/model.pth
wikitext:
 word_perplexity,none: 19.2032
 byte_perplexity,none: 1.7378
 bits_per_byte,none: 0.7973
 alias: wikitext

From current master:

python3 torchchat.py eval llama3.2-1b --quantize '{"linear:int8": {"groupsize": 0}, "executor":{"accelerator":"cuda"}}' --compile
Using device=cuda
Loading model...
Time to load model: 1.20 seconds
Quantizing the model with: {'linear:int8': {'groupsize': 0}, 'executor': {'accelerator': 'cuda'}}
Time to quantize model: 0.19 seconds
-----------------------------------------------------------
2024-10-24:15:43:59,945 INFO     [huggingface.py:162] Using device 'cuda'
2024-10-24:15:44:07,664 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:44:07,664 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:44:07,664 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-10-24:15:44:07,664 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-10-24:15:44:07,664 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-10-24:15:44:07,664 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:44:09,261 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2024-10-24:15:44:09,342 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 463.50it/s]
2024-10-24:15:44:09,482 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [01:00<00:00,  1.03it/s]
Time to run eval: 70.16s.
Time in model.forward: 53.46s, over 162 model evaluations
forward run time stats - Median: 0.00s Min: 0.00s Max: 33.02s
For model /home/puri/.torchchat/model-cache/meta-llama/Meta-Llama-3.2-1B-Instruct/model.pth
wikitext:
 word_perplexity,none: 19.2432
 byte_perplexity,none: 1.7385
 bits_per_byte,none: 0.7978
 alias: wikitext

Lint

pip install -r install/requirements-lintrunner.txt 
lintrunner -a

pytorch-bot · 2024-10-24T22:51:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1328

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1a42fb6 with merge base e30aaa0 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2024-10-24T23:04:34Z

thanks! can you add a generate.py speed benchmark result for before and after as well

Jack-Khuu · 2024-10-24T23:01:01Z

torchchat/utils/quantize.py

            # Use tensor subclass API for int4 weight only.
            if device == "cuda" and quantizer == "linear:int4":
                quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
+            elif quantizer == "linear:int8":
+                print("quantizer is linear int8")


Suggested change

print("quantizer is linear int8")

Jack-Khuu · 2024-10-24T23:04:10Z

torchchat/utils/quantize.py

    "precision": PrecisionHandler,
    "executor": ExecutorHandler,
    "linear:int4": Int4WeightOnlyQuantizer,
+    "linear:int8": int8_weight_only,


Do we need this?

we can probably use None for now, and remove this later

We check for int8_weight_only and finished check before it looks at the table I think

@vmpuri can you check?

Jack-Khuu · 2024-10-25T00:06:50Z

Can you ack that the numerics look good for MPS and CPU as well?

mikekgfb · 2024-10-25T07:38:58Z

torchchat/utils/quantize.py

            # Use tensor subclass API for int4 weight only.
            if device == "cuda" and quantizer == "linear:int4":
                quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
+            elif quantizer == "linear:int8":
+                print("quantizer is linear int8")
+                quantize_(model, int8_weight_only())


Why not integrate it into a QuantHandler class dispatched thru the handler dict at a single call site rather than build a chain of if statements?

Hi @mikekgfb, we will refactor this part in the future after all quant APIs are moved to torchao I think

torchAO already has a class-based API that is used for other quantizers? Why do these differently, and then later refactor them? Or why not do them all a consistent way now, and if you refactor later, do that?

yeah, quantizer API is deprecated in favor of quantize_, that's why we are gradually refactoring the quantizer APIs to use quantize_, the reason we do it one by one is because there might be missing support/alignment on numerics etc. that we need to do during the migration

Jack-Khuu · 2024-11-12T22:32:35Z

torchchat/utils/quantize.py

-        return linear_int8_aoti(input, self.weight, self.scales)
-
-    def et_forward(self, input: torch.Tensor) -> torch.Tensor:
-        return linear_int8_et(input, self.weight, self.scales)


Int 8 seems like it special cased for ET, reminder to check that as well

vmpuri requested review from jerryzh168, larryliu0820, Jack-Khuu and HDCharles October 24, 2024 22:51

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 24, 2024

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization

92e0a9d

vmpuri force-pushed the torchao_int8_weight_only branch from d43d52e to 92e0a9d Compare October 24, 2024 22:52

vmpuri marked this pull request as ready for review October 24, 2024 22:57

Jack-Khuu reviewed Oct 24, 2024

View reviewed changes

Jack-Khuu approved these changes Oct 25, 2024

View reviewed changes

mikekgfb reviewed Oct 25, 2024

View reviewed changes

Merge branch 'main' into torchao_int8_weight_only

1a42fb6

Jack-Khuu reviewed Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

vmpuri commented Oct 24, 2024 •

edited

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

jerryzh168 commented Oct 24, 2024 •

edited

Loading

Jack-Khuu Oct 24, 2024

Jack-Khuu Oct 24, 2024

jerryzh168 Oct 25, 2024

Jack-Khuu Oct 25, 2024

Jack-Khuu commented Oct 25, 2024

mikekgfb Oct 25, 2024

jerryzh168 Oct 28, 2024

mikekgfb Oct 30, 2024

jerryzh168 Oct 31, 2024

Jack-Khuu Nov 12, 2024

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

Are you sure you want to change the base?

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

Conversation

vmpuri commented Oct 24, 2024 • edited Loading

pytorch-bot bot commented Oct 24, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1328

✅ No Failures

jerryzh168 commented Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jack-Khuu commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmpuri commented Oct 24, 2024 •

edited

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

jerryzh168 commented Oct 24, 2024 •

edited

Loading