The Intel® Neural Compressor (former name Intel® Low Precision Optimization Tool) is an open-source Python library that delivers a unified low-precision inference interface across multiple Intel-optimized Deep Learning (DL) frameworks on both CPUs and GPUs. It supports automatic accuracy-driven tuning strategies, along with additional objectives such as optimizing for performance, model size, and memory footprint. It also provides easy extension capability for new backends, tuning strategies, metrics, and objectives.
Note
GPU support is under development.
Visit the Intel® Neural Compressor online document website at: https://intel.github.io/neural-compressor/.
Intel® Neural Compressor features an infrastructure and workflow that aids in increasing performance and faster deployments across architectures.
Click the image to enlarge it.
Click the image to enlarge it.
Supported Intel-optimized DL frameworks are:
- TensorFlow*, including 1.15.0 UP3, 1.15.0 UP2, 1.15.0 UP1, 2.1.0, 2.2.0, 2.3.0, 2.4.0, 2.5.0, Official TensorFlow 2.6.0
Note: Intel Optimized TensorFlow 2.5.0 requires to set environment variable TF_ENABLE_MKL_NATIVE_FORMAT=0 before running Intel® Neural Compressor quantization or deploying the quantized model.
Note: From Official TensorFlow 2.6.0, oneDNN support has been upstreamed. User just need download official TensorFlow binary for CPU device and set environment variable TF_ENABLE_ONEDNN_OPTS=1 before running Intel® Neural Compressor quantization or deploying the quantized model.
- PyTorch*, including 1.5.0+cpu, 1.6.0+cpu, 1.8.0+cpu
- Apache* MXNet, including 1.6.0, 1.7.0, 1.8.0
- ONNX* Runtime, including 1.6.0, 1.7.0, 1.8.0
Select the installation based on your operating system.
You can install Intel® Neural Compressor using one of three options: Install just the Intel® Neural Compressor library from binary or source, or get the Intel-optimized framework together with the Intel® Neural Compressor library by installing the Intel® oneAPI AI Analytics Toolkit.
# install stable version from pip
pip install lpot
# install nightly version from pip
pip install -i https://test.pypi.org/simple/ lpot
# install stable version from from conda
conda install lpot -c conda-forge -c intel
git clone https://github.com/intel/neural-compressor.git
cd lpot
pip install -r requirements.txt
python setup.py install
The Intel® Neural Compressor library is released as part of the Intel® oneAPI AI Analytics Toolkit (AI Kit). The AI Kit provides a consolidated package of Intel's latest deep learning and machine optimizations all in one place for ease of development. Along with Intel® Neural Compressor, the AI Kit includes Intel-optimized versions of deep learning frameworks (such as TensorFlow and PyTorch) and high-performing Python libraries to streamline end-to-end data science and AI workflows on Intel architectures.
The AI Kit is distributed through many common channels, including from Intel's website, YUM, APT, Anaconda, and more. Select and download the AI Kit distribution package that's best suited for you and follow the Get Started Guide for post-installation instructions.
Download AI Kit | AI Kit Get Started Guide |
---|
Prerequisites
The following prerequisites and requirements must be satisfied for a successful installation:
-
Python version: 3.6 or 3.7 or 3.8 or 3.9
-
Download and install anaconda.
-
Create a virtual environment named lpot in anaconda:
# Here we install python 3.7 for instance. You can also choose python 3.6, 3.8, or 3.9. conda create -n lpot python=3.7 conda activate lpot
Installation options
# install stable version from pip
pip install lpot
# install nightly version from pip
pip install -i https://test.pypi.org/simple/ lpot
# install from conda
conda install lpot -c conda-forge -c intel
git clone https://github.com/intel/neural-compressor.git
cd lpot
pip install -r requirements.txt
python setup.py install
Get Started
- APIs explains Intel® Neural Compressor's API.
- Transform introduces how to utilize Intel® Neural Compressor's built-in data processing and how to develop a custom data processing method.
- Dataset introduces how to utilize Intel® Neural Compressor's built-in dataset and how to develop a custom dataset.
- Metric introduces how to utilize Intel® Neural Compressor's built-in metrics and how to develop a custom metric.
- Tutorial provides comprehensive instructions on how to utilize Intel® Neural Compressor's features with examples.
- Examples are provided to demonstrate the usage of Intel® Neural Compressor in different frameworks: TensorFlow, PyTorch, MXNet, and ONNX Runtime.
- UX is a web-based system used to simplify Intel® Neural Compressor usage.
- Intel oneAPI AI Analytics Toolkit Get Started Guide explains the AI Kit components, installation and configuration guides, and instructions for building and running sample apps.
- AI and Analytics Samples includes code samples for Intel oneAPI libraries.
Deep Dive
- Quantization are processes that enable inference and training by performing computations at low-precision data types, such as fixed-point integers. Intel® Neural Compressor supports Post-Training Quantization (PTQ) with different quantization capabilities and Quantization-Aware Training (QAT). Note that (Dynamic Quantization) currently has limited support.
- Pruning provides a common method for introducing sparsity in weights and activations.
- Benchmarking introduces how to utilize the benchmark interface of Intel® Neural Compressor.
- Mixed precision introduces how to enable mixed precision, including BFP16 and int8 and FP32, on Intel platforms during tuning.
- Graph Optimization introduces how to enable graph optimization for FP32 and auto-mixed precision.
- Model Conversion introduces how to convert TensorFlow QAT model to quantized model running on Intel platforms.
- TensorBoard provides tensor histograms and execution graphs for tuning debugging purposes.
Advanced Topics
- Adaptor is the interface between Intel® Neural Compressor and framework. The method to develop adaptor extension is introduced with ONNX Runtime as example.
- Strategy can automatically optimized low-precision recipes for deep learning models to achieve optimal product objectives like inference performance and memory usage with expected accuracy criteria. The method to develop a new strategy is introduced.
Publications
- MLPerf™ Performance Gains Abound with latest 3rd Generation Intel® Xeon® Scalable Processors (Apr 2021)
- 3D Digital Face Reconstruction Solution enabled by 3rd Gen Intel® Xeon® Scalable Processors (Apr 2021)
- Accelerating Alibaba Transformer model performance with 3rd Gen Intel® Xeon® Scalable Processors (Ice Lake) and Intel® Deep Learning Boost (Apr 2021)
- Using Low-Precision Optimizations for High-Performance DL Inference Applications (Apr 2021)
- DL Boost Quantization with CERN's 3D-GANs model (Feb 2021)
Full publication list please refers to here
Intel® Neural Compressor supports systems based on Intel 64 architecture or compatible processors, specially optimized for the following CPUs:
- Intel Xeon Scalable processor (formerly Skylake, Cascade Lake, Cooper Lake, and Icelake)
- future Intel Xeon Scalable processor (code name Sapphire Rapids)
Intel® Neural Compressor requires installing the Intel-optimized framework version for the supported DL framework you use: TensorFlow, PyTorch, MXNet, or ONNX runtime.
Note: Intel Neural Compressor supports Intel-optimized and official frameworks for some TensorFlow versions. Refer to Supported Frameworks for specifics.
Platform | OS | Python | Framework | Version |
---|---|---|---|---|
Cascade Lake Cooper Lake Skylake Ice Lake |
CentOS 8.3 Ubuntu 18.04 |
3.6 3.7 3.8 3.9 |
TensorFlow | 2.6.0 |
2.5.0 | ||||
2.4.0 | ||||
2.3.0 | ||||
2.2.0 | ||||
2.1.0 | ||||
1.15.0 UP1 | ||||
1.15.0 UP2 | ||||
1.15.0 UP3 | ||||
1.15.2 | ||||
PyTorch | 1.5.0+cpu | |||
1.6.0+cpu | ||||
1.8.0+cpu | ||||
IPEX | ||||
MXNet | 1.8.0 | |||
1.7.0 | ||||
1.6.0 | ||||
ONNX Runtime | 1.6.0 | |||
1.7.0 | ||||
1.8.0 |
Intel® Neural Compressor provides numerous examples to show promising accuracy loss with the best performance gain. A full quantized model list on various frameworks is available in the Model List.
Model | Framework | Support | Example |
---|---|---|---|
ResNet50 v1.5 | TensorFlow | Yes | Link |
PyTorch | Yes | Link | |
DLRM | PyTorch | Yes | Link |
BERT-large | TensorFlow | Yes | Link |
PyTorch | Yes | Link | |
SSD-ResNet34 | TensorFlow | WIP | |
PyTorch | Yes | Link | |
RNN-T | PyTorch | WIP | |
3D-UNet | TensorFlow | WIP | |
PyTorch | Yes | Link |
Framework | Version | Model | Accuracy | Performance | ||||
---|---|---|---|---|---|---|---|---|
INT8 Tuning Accuracy | FP32 Accuracy Baseline | Acc Ratio [(INT8-FP32)/FP32] | INT8 realtime(ms) CLX8280 1s 4c per instance |
FP32 realtime(ms) CLX8280 1s 4c per instance |
Realtime Latency Ratio[FP32/INT8] | |||
tensorflow | 2.5.0 | resnet50v1.0 | 74.24% | 74.27% | -0.04% | 7.64 | 21.54 | 2.82x |
tensorflow | 2.5.0 | resnet50v1.5 | 76.94% | 76.46% | 0.63% | 9.54 | 24.28 | 2.54x |
tensorflow | 2.5.0 | resnet101 | 77.21% | 76.45% | 0.99% | 12.92 | 30.65 | 2.37x |
tensorflow | 2.5.0 | inception_v1 | 70.30% | 69.74% | 0.80% | 5.58 | 10.13 | 1.82x |
tensorflow | 2.5.0 | inception_v2 | 74.27% | 73.97% | 0.41% | 6.78 | 12.42 | 1.83x |
tensorflow | 2.5.0 | inception_v3 | 77.29% | 76.75% | 0.70% | 12.90 | 27.74 | 2.15x |
tensorflow | 2.5.0 | inception_v4 | 80.36% | 80.27% | 0.11% | 21.00 | 54.42 | 2.59x |
tensorflow | 2.5.0 | inception_resnet_v2 | 80.42% | 80.40% | 0.02% | 44.72 | 87.62 | 1.96x |
tensorflow | 2.5.0 | mobilenetv1 | 73.93% | 70.96% | 4.19% | 2.96 | 9.88 | 3.34x |
tensorflow | 2.5.0 | mobilenetv2 | 71.96% | 71.76% | 0.28% | 4.95 | 10.71 | 2.16x |
tensorflow | 2.5.0 | ssd_resnet50_v1 | 37.91% | 38.00% | -0.24% | 145.96 | 422.11 | 2.89x |
tensorflow | 2.5.0 | ssd_mobilenet_v1 | 23.02% | 23.13% | -0.48% | 12.19 | 26.85 | 2.20x |
Framework | Version | Model | Accuracy | Performance | ||||
---|---|---|---|---|---|---|---|---|
INT8 Tuning Accuracy | FP32 Accuracy Baseline | Acc Ratio [(INT8-FP32)/FP32] | INT8 realtime(ms) CLX8280 1s 4c per instance |
FP32 realtime(ms) CLX8280 1s 4c per instance |
Realtime Latency Ratio[FP32/INT8] | |||
pytorch | 1.9.0+cpu | resnet18 | 69.58% | 69.76% | -0.26% | 13.59 | 24.97 | 1.84x |
pytorch | 1.9.0+cpu | resnet50 | 75.87% | 76.13% | -0.34% | 25.67 | 54.12 | 2.11x |
pytorch | 1.9.0+cpu | resnext101_32x8d | 79.09% | 79.31% | -0.28% | 62.44 | 147.88 | 2.37x |
pytorch | 1.9.0+cpu | bert_base_mrpc | 88.16% | 88.73% | -0.64% | 41.33 | 81.93 | 1.98x |
pytorch | 1.9.0+cpu | bert_base_cola | 58.29% | 58.84% | -0.93% | 39.30 | 86.58 | 2.20x |
pytorch | 1.9.0+cpu | bert_base_sts-b | 88.65% | 89.27% | -0.70% | 39.46 | 86.97 | 2.20x |
pytorch | 1.9.0+cpu | bert_base_sst-2 | 91.63% | 91.86% | -0.25% | 39.12 | 82.59 | 2.11x |
pytorch | 1.9.0+cpu | bert_base_rte | 69.31% | 69.68% | -0.52% | 39.81 | 81.98 | 2.06x |
pytorch | 1.9.0+cpu | bert_large_mrpc | 87.48% | 88.33% | -0.95% | 112.61 | 287.44 | 2.55x |
pytorch | 1.9.0+cpu | bert_large_squad | 92.79 | 93.05 | -0.28% | 497.79 | 953.74 | 1.92x |
pytorch | 1.9.0+cpu | bert_large_qnli | 91.12% | 91.82% | -0.76% | 112.43 | 291.10 | 2.59x |
pytorch | 1.9.0+cpu | bert_large_rte | 72.92% | 72.56% | 0.50% | 148.60 | 287.03 | 1.93x |
pytorch | 1.9.0+cpu | bert_large_cola | 62.85% | 62.57% | 0.45% | 112.54 | 283.38 | 2.52x |
Tasks | FWK | Model | fp32 baseline | gradient sensitivity with 20% sparsity | +onnx dynamic quantization on pruned model | ||||
---|---|---|---|---|---|---|---|---|---|
accuracy% | drop% | perf gain (sample/s) | accuracy% | drop% | perf gain (sample/s) | ||||
SST-2 | pytorch | bert-base | accuracy = 92.32 | accuracy = 91.97 | -0.38 | 1.30x | accuracy = 92.20 | -0.13 | 1.86x |
QQP | pytorch | bert-base | [accuracy, f1] = [91.10, 88.05] | [accuracy, f1] = [89.97, 86.54] | [-1.24, -1.71] | 1.32x | [accuracy, f1] = [89.75, 86.60] | [-1.48, -1.65] | 1.81x |
Tasks | FWK | Model | fp32 baseline | Pattern Lock on 70% Unstructured Sparsity | Pattern Lock on 50% 1:2 Structured Sparsity | ||
---|---|---|---|---|---|---|---|
accuracy% | drop% | accuracy% | drop% | ||||
MNLI | pytorch | bert-base | [m, mm] = [84.57, 84.79] | [m, mm] = [82.45, 83.27] | [-2.51, -1.80] | [m, mm] = [83.20, 84.11] | [-1.62, -0.80] |
SST-2 | pytorch | bert-base | accuracy = 92.32 | accuracy = 91.51 | -0.88 | accuracy = 92.20 | -0.13 |
QQP | pytorch | bert-base | [accuracy, f1] = [91.10, 88.05] | [accuracy, f1] = [90.48, 87.06] | [-0.68, -1.12] | [accuracy, f1] = [90.92, 87.78] | [-0.20, -0.31] |
QNLI | pytorch | bert-base | accuracy = 91.54 | accuracy = 90.39 | -1.26 | accuracy = 90.87 | -0.73 |
QnA | pytorch | bert-base | [em, f1] = [79.34, 87.10] | [em, f1] = [77.27, 85.75] | [-2.61, -1.54] | [em, f1] = [78.03, 86.50] | [-1.65, -0.69] |
Framework | Model | fp32 baseline | Compression | dataset | acc(drop)% |
---|---|---|---|---|---|
Pytorch | resnet18 | 69.76 | 30% sparsity on magnitude | ImageNet | 69.47(-0.42) |
Pytorch | resnet18 | 69.76 | 30% sparsity on gradient sensitivity | ImageNet | 68.85(-1.30) |
Pytorch | resnet50 | 76.13 | 30% sparsity on magnitude | ImageNet | 76.11(-0.03) |
Pytorch | resnet50 | 76.13 | 30% sparsity on magnitude and post training quantization | ImageNet | 76.01(-0.16) |
Pytorch | resnet50 | 76.13 | 30% sparsity on magnitude and quantization aware training | ImageNet | 75.90(-0.30) |