Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Out Variants When Running Llama3.2 Example Without XNNPack #6975

Open
sheetalarkadam opened this issue Nov 20, 2024 · 3 comments
Open

Comments

@sheetalarkadam
Copy link

sheetalarkadam commented Nov 20, 2024

I am follwing the instructions in the Llama2 README to test llama model with Executorch.
I want to compare the performance of the model with and without XNNPack. From the code, it seems that DQLinear operations are delegated to XNNPack by default. However, I would like to understand how to use the quantized ops defined in Executorch, as listed in quantized.yaml. Could you provide guidance on configuring the model to use Executorch's quantized ops instead of XNNPack?

I encounter the following error when the -X(--xnnpack) flag is removed from the python export:
raise RuntimeError(f"Missing out variants: {missing_out_vars}") RuntimeError: Missing out variants: {'quantized_decomposed::choose_qparams_per_token_asymmetric', 'quantized_decomposed::dequantize_per_channel', 'quantized_decomposed::dequantize_per_channel_group', 'quantized_decomposed::dequantize_per_token', 'quantized_decomposed::quantize_per_token'}

LLAMA_QUANTIZED_CHECKPOINT=/content/SpinQuant_workspace/consolidated.00.pth
LLAMA_PARAMS= /src/gitrepo/llama/Llama3.2-1B/params.json
python -m examples.models.llama2.export_llama \
   --checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
   --params "${LLAMA_PARAMS:?}" \
   --use_sdpa_with_kv_cache \
   --preq_mode 8da4w_output_8da8w \
   --preq_group_size 32 \
   --max_seq_length 2048 \
   --output_name "llama3_2_noxnn.pte" \
   -kv \
   -d fp32 \
   --preq_embedding_quantize 8,0 \
   --use_spin_quant native \
   --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}'

What adjustments are required to resolve the "missing out variants" error when the -X flag is omitted?
Thank you for your assistance!

Versions

Collecting environment information...
PyTorch version: 2.6.0.dev20240927+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.31.0
Libc version: glibc-2.35

Python version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.167.1-1.cm2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 12.6.77

Versions of relevant libraries:
[pip3] executorch==0.5.0a0+20a157f
[pip3] numpy==1.26.4
[pip3] torch==2.6.0.dev20240927+cpu
[pip3] torchao==0.5.0+git0916b5b2
[pip3] torchaudio==2.5.0.dev20240927+cpu
[pip3] torchsr==1.0.4
[pip3] torchvision==0.20.0.dev20240927+cpu
[conda] executorch 0.5.0a0+20a157f pypi_0 pypi
[conda] numpy 1.26.4 pypi_0 pypi
[conda] torch 2.6.0.dev20240927+cpu pypi_0 pypi
[conda] torchaudio 2.5.0.dev20240927+cpu pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.20.0.dev20240927+cpu pypi_0 pypi

@metascroy
Copy link
Contributor

Can you add try adding import executorch.kernels.quantized to export_llama.py, like this:

import executorch.kernels.quantized # noqa[F401] 'executorch.kernels.quantized' imported but unused

I don't think we have a quantized linear kernel in ExecuTorch outside of XNNPACK or torchao, so I guess using those ops probably dequantizes the weights and does the linear computation in float32, and it might not be a good comparison.

cc @larryliu0820 for missing ops and @digantdesai for XNNPACK

@digantdesai
Copy link
Contributor

Hmm...We should have quantize_per_token_out for example in executorch/kernels/quantized/cpu/op_quantize.cpp. And we should link against the quantized_ops_lib. And we should have tests for running quantized Llama with portable-ops only, not sure about Llama 3.2 though.

@sheetalarkadam
Copy link
Author

sheetalarkadam commented Nov 26, 2024

@digantdesai the only missing op in executorch/kernels/quantized/cpu is dequantize_per_channel_group. But even after adding import executorch.kernels.quantized I get the same error but it does find the op dequantize_per_channel. I also see the linkage target_link_options_shared_lib(quantized_ops_lib) in CMakelist.txt .

@metascroy To try using the torchao ops I am currently trying to use the main branch but hitting some minor issues like quantization args not getting passed to ModelArgs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants