-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dynamic quantization option for PyTorch models at upload #594
Comments
Dynamic quantisation is controlled by the |
To understand exactly what happens when quantising on a different architecture to the one used at evaluation I used the Tracing the model with the
Full stack traceTraceback (most recent call last): File "/usr/local/bin/eland_import_hub_model", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/dist-packages/eland/cli/eland_import_hub_model.py", line 235, in main tm = TransformerModel( File "/usr/local/lib/python3.9/dist-packages/eland/ml/pytorch/transformers.py", line 630, in __init__ self._traceable_model.quantize() File "/usr/local/lib/python3.9/dist-packages/eland/ml/pytorch/traceable_model.py", line 43, in quantize torch.quantization.quantize_dynamic( File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 450, in quantize_dynamic convert(model, mapping, inplace=True) File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 535, in convert _convert( File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 573, in _convert _convert(mod, mapping, True, # inplace File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 573, in _convert _convert(mod, mapping, True, # inplace File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 573, in _convert _convert(mod, mapping, True, # inplace [Previous line repeated 3 more times] File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 575, in _convert reassign[name] = swap_module(mod, mapping, custom_module_class_mapping) File "/usr/local/lib/python3.9/dist-packages/torch/ao/quantization/quantize.py", line 608, in swap_module new_mod = qmod.from_float(mod) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 111, in from_float qlinear = cls(mod.in_features, mod.out_features, dtype=dtype) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 35, in __init__ super(Linear, self).__init__(in_features, out_features, bias_, dtype=dtype) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/modules/linear.py", line 150, in __init__ self._packed_params = LinearPackedParams(dtype) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/modules/linear.py", line 27, in __init__ self.set_weight_bias(wq, None) File "/usr/local/lib/python3.9/dist-packages/torch/ao/nn/quantized/modules/linear.py", line 32, in set_weight_bias self._packed_params = torch.ops.quantized.linear_prepack(weight, bias) File "/usr/local/lib/python3.9/dist-packages/torch/_ops.py", line 442, in __call__ return self._op(*args, **kwargs or {}) RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngineThe models
The 8.9 docker image with version 1.13.1 of PyTorch was used in this test. |
Dynamic quantization of PyTorch models has proven to be a challenge for two reasons.
(1) Dynamic quantization ties the traced TorchScript model to a particular architecture and makes it non-portable. For example, tracing the model (by using the upload CLI) on an ARM-based M-series Apple processor will make it non-portable to an Intel CPU, and vice versa. Tracing a model in this way also means that any Intel-based optimisations cannot be used. The best practice is to trace the model on the same CPU architecture as the target inference processors. Adding in GPU support adds a further complexity and eland is currently not even capable of tracing with GPU (for now).
(2) "Blind" dynamic quantization at upload time could also be considered as an anti-pattern/not a best practice. Quantization can often damage the accuracy of a model and doing quantization blindly, without evaluating the model afterwards, can produce surprising results at inference.
For these reasons, we believe it is safest to remove dynamic quantization as an option. If users would like to use quantized models, they can do so in PyTorch or
transformers
directly, and upload their new model with eland's Python methods (as opposed to using the CLI).The text was updated successfully, but these errors were encountered: