Skip to content

v2.14.0

Latest
Compare
Choose a tag to compare
@KodiaqQ KodiaqQ released this 20 Nov 11:49
· 2256 commits to develop since this release

Post-training Quantization:

Features:

  • Introduced backup_mode optional parameter in nncf.compress_weights() to specify the data type for embeddings, convolutions and last linear layers during 4-bit weights compression. Available options are INT8_ASYM by default, INT8_SYM, and NONE which retains the original floating-point precision of the model weights.
  • Added the quantizer_propagation_rule parameter, providing fine-grained control over quantizer propagation. This advanced option is designed to improve accuracy for models where quantizers with different granularity could be merged to per-tensor, potentially affecting model accuracy.
  • Introduced nncf.data.generate_text_data API method that utilizes LLM to generate data for further data-aware optimization. See the example for details.
  • (OpenVINO) Extended support of data-free and data-aware weight compression methods for nncf.compress_weights() with NF4 per-channel quantization, which makes compressed LLMs more accurate and faster on NPU.
  • (OpenVINO) Introduced a new option statistics_path to cache and reuse statistics for nncf.compress_weights(), reducing the time required to find optimal compression configurations. See the TinyLlama example for details.
  • (TorchFX, Experimental) Added support for quantization and weight compression of Torch FX models. The compressed models can be directly executed via torch.compile(compressed_model, backend="openvino") (see details here). Added INT8 quantization example. The list of supported features:
    • INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via nncf.quantize().
    • Data-free INT8, INT4, and mixed-precision weights compression with nncf.compress_weights().
  • (PyTorch, Experimental) Added model tracing and execution pre-post hooks based on TorchFunctionMode.

Fixes:

  • Resolved an issue with redundant quantizer insertion before elementwise operations, reducing noise introduced by quantization.
  • Fixed type mismatch issue for nncf.quantize_with_accuracy_control().
  • Fixed BiasCorrection algorithm for specific branching cases.
  • (OpenVINO) Fixed GPTQ weight compression method for Stable Diffusion models.
  • (OpenVINO) Fixed issue with the variational statistics processing for nncf.compress_weights().
  • (PyTorch, ONNX) Scaled dot product attention pattern quantization setup is aligned with OpenVINO.

Improvements:

  • Reduction in peak memory by 30-50% for data-aware nncf.compress_weights() with AWQ, Scale Estimation, LoRA and mixed-precision algorithms.
  • Reduction in compression time by 10-20% for nncf.compress_weights() with AWQ algorithm.
  • Aligned behavior for ignored subgraph between different networkx versions.
  • Extended ignored patterns with RoPE block for nncf.ModelType.TRANSFORMER scheme.
  • (OpenVINO) Extended to the ignored scope for nncf.ModelType.TRANSFORMER scheme with GroupNorm metatype.
  • (ONNX) SE-block ignored pattern variant for torchvision mobilenet_v3 has been extended.

Tutorials:

Known issues:

  • (ONNX) nncf.quantize() method can generate inaccurate INT8 results for MobileNet models with the BiasCorrection algorithm.

Deprecations/Removals:

  • Migrated from using setup.py to pyproject.toml for the build and package configuration. It is aligned with Python packaging standards as outlined in PEP 517 and PEP 518. The installation through setup.py does not work anymore. No impact on the installation from PyPI and Conda.
  • Removed support for Python 3.8.
  • (PyTorch) nncf.torch.create_compressed_model() function has been deprecated.

Requirements:

  • Updated ONNX (1.17.0) and ONNXRuntime (1.19.2) versions.
  • Updated PyTorch (2.5.1) and Torchvision (0.20.1) versions.
  • Updated NumPy (<2.2.0) version support.
  • Updated Ultralytics (8.3.22) version.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@rk119
@zina-cs