Intel® Optimizations for TensorFlow 2.13.0
This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.13.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.13.0, you can check the release notes of TensorFlow 2.13. This build was built from v2.13.0.
This release notes cover optimizations made in both Intel® Optimization for TensorFlow* and official TensorFlow v2.13.0 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.
Breaking changes:
oneDNN ops which rely on blocked format are no longer supported and have been changed to return an error. This is in preparation for completely removing them from TensorFlow in the next release. Users are encouraged to use the corresponding Eigen ops instead. Below is the list of such ops that are no longer supported:
- Element-wise ops such as MklAddN, MklAdd, MklAddV2, MklSub, MklMul, MklMaximum, MklSquaredDifference.
- MklIdentity
- MklInputConversion
- MklLRN and MklLRNGrad
- MklReshape
- MklSlice
- MklToTf
Major features:
- See TensorFlow 2.13.0 release notes
- Enabled reduced precision floating point arithmetic mode via a new environment variable ‘TF_SET_ONEDNN_FPMATH_MODE’. This variable can be set to “BF16” to allow down-conversions from FP32 to BF16 to speedup computations without noticeable impact on accuracy.
- Enabled ITT tagging by default for oneDNN primitives on Intel® VTune™ Profiler. This helps users to identify platform bottlenecks with detailed information such as L1/L2 cache misses or level of FP vectorization at the primitive level on VTune.
- Supported platforms: Linux
Improvements:
- Parallelized UnsortedSegment op with a simpler algorithm for workload balancing. This resulted in ~1.92 - 14.46x performance speedup in microbenchmarks and ~5% throughput performance speedup in public recommendation models on CPU.
- Added support for setting maximum number of threads available to oneDNN at primitive creation time for Eigen threadpool backend. This resulted in up to ~1.5x performance speedup in convolution microbenchmarks and higher CPU utilization in hyperthreading enabled systems.
- Enabled optimized implementation for FP32 using the new Eigen LeakyRelu functor leading to ~55% performance gain on average as measured by kernel microbenchmarks.
- Added BF16 support for FusedPadConv2D op since there are many occurrences of this op in CycleGAN model. This feature resulted in ~10% performance improvement in CycleGAN with AMP (auto-mixed precision) enabled.
- Added support for fusing element-wise ops such as LeakyRelu and Mish activations with Fused-Conv2D/3D in remapper. This resulted in ~8% performance speedup on average for models containing such pattern.
- Changed default initialization behavior for inter-op parallelism threads when it is negative by the caller thread by resetting it’s value to 1. This helped in fixing performance degradations for weight sharing.
- Added support for oneDNN v3.1 in the following ops: convolution (fwd + bwd), matmul, einsum, transpose, softmax, layernorm, concat, element-wise ops, pooling, quantize, dequantize, quantized-concat, requantization-range-per-channel, requantize-per-channel. oneDNN v3.1 can be conditionally compiled by passing “--config=mkl --define=build_with_onednn_v3=true --define=build_with_onednn_v2=false” flags when building TensorFlow.
- Added support for weight caching in convolution for oneDNN v3.1.
- Added support for Mul + BatchMatMulV2 + Add fusion for FP32 and BF16 in oneDNN fused-batch-matmul op since this pattern occurs DistilBert.
- Added kernel support for Instance Normalization for FP32 and BF16. This includes fusing breakdown ops along with optional Relu/LeakyRelu into a single Instance Normalization op in the graph pass.
- Added support for Quantized Maxpool3D op.
- Added support for FusedBatchNormEx fusion to TFG MLIR grappler.
- Added support for AsString + StringToHashBucketFast fusion in TFG MLIR grappler.
- Cleaned up CUDA/oneDNN warnings produced by TensorFlow when running on a machine without a GPU. This is to provide more meaningful CUDA/oneDNN warnings depending on the machine in which TensorFlow is being run.
Bug fixes:
- Resolved issues in TensorFlow 2.13.0
- Resolved issues in oneDNN v2.7 and oneDNN v3.1.
- Fixed all issues found during static scan analyses.
- Updated the function signature of oneDNN FusedConv2D to align with generic FusedConv2D. This was done to remove a workaround which was previously applied to fix a crash in oneDNN FusedConv2D.
- Added unit tests for bfloat16 AddN and Round op.
- Fixed potential NPE (null-pointer exception) in quantized ops by adding index-validity checks for min/max tensors.
- Fixed a bug in framework::function_test by adjusting the relative tolerance of the unit test.
- Fixed potential accuracy issues by moving Mean op back to AMP (auto-mixed precision) deny list.
- Fixed a build failure in mkl_fused_batch_norm_op test by adding relative error tolerance which was caused by using a different GEMM API.
- Added error checking to oneDNN AvgPoolGrad kernel to avoid out-of-bounds output memory access.
- Fixed a crash in Mul + Maximum + LeakyRelu fusion for BF16 by fusing Mul + Maximum in the first remapper pass to avoid Cast -> Const conversions for LeakyRelu’s alpha parameter.
- Fixed a performance issue where large matmuls were running on a single thread by storing the dimension sizes of input and output matrices in int64_t instead of int.
- Reverted logging errors added for executor failures since it resulted in non-negligible performance drop when running some models.
Versions and components:
- Intel® optimized TensorFlow based on TensorFlow v2.13.0: r2.13.0_intel_release
- TensorFlow v2.13.0: v2.13.0
- oneDNN v2.7.3: oneDNN v2.7.3
- oneDNN v3.1: oneDNN v3.1
- Model Zoo for Intel® Architecture: Model Zoo
Known issues:
- bfloat16 is not guaranteed to work on AVX or AVX2 systems.