This is a tutorial on now to optimize huggingface BERT models using the custom operator. This exmaple uses an approach that does not make use of the model_modifier module and works using the same approach as the PyTorch optimization.
- the operator built with
-DBACKENDS=TF
- python dependencies, see here
- PyTorch (optional for accuracy demo)
- Make sure the path to the compiled operator .so is exported in the
BERT_OP_LIB
environment variable:
export BERT_OP_PT_LIB=/<path-to-build-dir>/src/pytorch_op/libBertOpPT.so
- Add the
python
subdirectory of this project to you PYTHONPATH:
export PYTHONPATH=$PYTHONPATH:<repo-root>/python
NOTE: This sample, by default, requires PyTorch to be installed, which allows transformers
to load the weights of a
torch model into a keras model. Alternatively, an MRPC-fine-tuned keras model can be used, in which case PyTorch
will not be required.
Navigate to the tests/tf2_no_model_modifier
subdirectory of the project and run the accuracy script:
cd <repo-root>/tests/tf2_no_model_modifier
python accuracy.py -p ../../jenkins_resources/tf2/quant_factors_uncased_L-12_H-768_A-12.txt
This will execute a default MRPC accuracy check with the following configuration:
- Intel/bert-base-uncased-mrpc model
- bert-base-uncased tokenizer
- 100 samples
- Use pre-computed quantization factors for BERT-base
You can run python accuracy.py -h
to view available options.
The accuracy script will first execute the huggingface model as-is. Then, it will be optimized with the BERT operator, and the same test samples will be fed to the model in the following modes:
- pure FP32
- FP32 + QINT8
- BF16
- BF16 + QINT8
A summary of the accuracy scores will then be printed to the console.
Navigate to the tests/tf2_no_model_modifier
subdirectory of the project and run the benchmark script with the desired configuration,
for example:
cd <repo-root>/tests/pytorch
python benchmark.py -m bert-large-uncased --bert-op --quantization -p <path-to-quantization-factors-file> --batch-size=4 --seq-len=128 --run-time=60
This will load the bert-large-uncased
model, optimize it with the BERT operator, then execute in QINT8 mode,
with a batch size of 4 and sequence length of 128. The benchmark will first run a number of warmup cycles (defaults to
10% of the measured run time, so 6 seconds in this case), then measure the average latency and throughput over 60
seconds.
Run python benchmark.py -h
for a full lits of options.
For comparison, you can then run the same benchmark without the --bert-op
flag to execute the unoptimized model.
The benchmark.py
script is also used in the docker-based sample found here. This is the
easiest way to see the BERT operator in action. refer to the README for
details.
Using the BERT operator in your model is very easy. Currently, all huggingface BERT models should benefit from this
optimization, i.e. all models that use the transformers.models.bert.modeling_bert.BertEncoder
class.
In order to start using the optimized BERT op, just import the bert_op
package in your code, before you load the
model. Assuming you have the environment set up and the operator is compiled (see Requirements),
adding the import bert_op
line should be all you need, for example:
import transformers
... # your code
import bert_op # Important, do this at any point BEFORE the call to `transformers.from_pretrained`
model = transformers.BertModel.from_pretrained('bert-base-uncased')
output = model(**inputs) # model now executes the BERT operator.
That's it! Your model is now using the BERT operator.
-
The operator only supports inference workloads, and currently only works on BERT models. (The Tensorflow operator has also been tested on RoBERTa models, which will likely be added to the Pytorch operator as well.)
-
The optimiziation is injected into the model via class substitution:
transformers.models.bert.modeling_bert.BertEncoder = BertEncoderOp
This means that any model which uses
transformers.models.bert.modeling_bert.BertEncoder
, will usebert_op.BertEncoderOp
instead, if it is created afterimport bert_op
. Models created before the import are unaffected.
The BERT operator can utilize Int8 quantization and BFloat16 computations on supported hardware. To enable these features, add appropriate fields to the BertConfig before loading the model:
import transformers
... # your code
import bert_op # Important, do this at any point BEFORE the call to `transformers.from_pretrained`
config = transformers.BertConfig.from_pretrained('bert-base-uncased')
config.use_bfloat16 = True # or False, defaults to False if not provided
config.use_quantization = True # or False, defaults to False if not provided
config.quant_factors_path = 'path/to/quant/factors/file' # necessary if config.use_quantization == True
model = transformers.TFBertModel.from_pretrained('bert-base-uncased', config=config)
output = model(**inputs)
When using the TensorFlow backend, the BERT operator loads quantization factors from an external file. In order to generate this file (i.e. calibrate the quantization factors), prepare the BertConfig for FP32 mode, set the calibration flag and provide an output path for the file before loading the model:
import transformers
... # your code
import bert_op # Important, do this at any point BEFORE the call to `transformers.from_pretrained`
config = transformers.BertConfig.from_pretrained('bert-base-uncased')
config.use_bfloat16 = False # These two can be omitted,
config.use_quantization = False # they are here just for clarity
config.calibrate_quant_factors = True # This enables calibration mode
config.quant_factors_path = 'path/to/quant/factors/file' # Quantization factors will be put here
model = transformers.TFBertModel.from_pretrained('bert-base-uncased', config=config)
output = model(**inputs) # model now executes in FP32 mode, and `TFBertEncoderOp` calibrates the quantization factors
Execute the model on a workload, and the operator will generate quantziation factors for itself and save them to the provided path.
Upon next execution, you cam set config.calibrate_quant_factors = False
, config.use_quantization = True
and set
config.quant_factors_path
to the same path that was used in calibration mode. The operator will load the quantization
factors and execute in QINT8 mode.