Skip to content

Intel® Optimizations for TensorFlow 2.12.0

Compare
Choose a tag to compare
@justkw justkw released this 29 Mar 20:00
· 30757 commits to master since this release
0fce3d3

This release of Intel® Optimized TensorFlow is based on the TensorFlow v2.12.0 tag and is built with support for oneDNN (oneAPI Deep Neural Network Library). For features and fixes that were introduced in TensorFlow 2.12.0, please check the release notes of TensorFlow 2.12. This build was built from v2.12.0.
This release notes cover both Intel® Optimizations for TensorFlow* and official TensorFlow v2.12 which has oneDNN optimizations enabled by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.

Major features:

  • See TensorFlow 2.12.0 release notes
  • Added support for Softmax forward for float32 and bfloat16 resulting in ~2-3x performance speedup on microbenchmarks measured on Intel Xeon CPUs.
  • Additional performance improvements were made for bfloat16 models using AMX. New operations were added to support bfloat16 to improve performance and reduce the number of “Cast” operations.
  • Made performance improvements to CPU memory allocator and Eigen threadpool’s task scheduling algorithm.
  • Supported platforms: Linux

Improvements:

  • Updated oneDNN version to v2.7.3
  • Added support for Softmax forward for float32 and bfloat16 types. This resulted in ~3x performance speedup for float32 microbenchmarks and ~2.8x speedup for bfloat16 microbenchmarks as measured on Intel Xeon CPUs with Eigen threadpool. It also improved inference performance by 12% on some models which use Softmax.
  • Updated bfloat16 auto-mixed-precision list by adding “Sum” and “Square” ops to the Infer list. This helped reduce the number of “Cast” operations around such ops and improved performance by 2x for some models.
  • Added support for fusing Gelu subgraph with MatMul and BiasAdd for float32 and bfloat16 types. This pattern is found in models such as BERT-base and BERT-large.
  • Added support for Conv2D + BiasAdd + Sigmoid + Mul and Conv2D + FusedBatchNorm/V2/V3 + Sigmoid + Mul fusions into FusedConv2D for float32 and bfloat16 types resulting in up to 15% performance improvement for some models.
  • Increased the threshold to use default memory allocator from 4K to 256K based on internal benchmarking.
  • Added an environment variable for improving Eigen threadpool’s task scheduling algorithm for cases when the number of threads is equal to the number of available CPU cores. This resulted in ~15% throughput performance improvement for float32 models and ~12% throughput performance improvement for bfloat16 models.
  • Added bfloat16 registration for Eigen’s FusedBatchNormV2 on CPU to reduce the number of “Cast” operations.
  • Added bfloat16 support for the following binary ops: xdivy, xlogy and xlog1py.
  • Added bfloat16 support for the following 3D pooling ops in Eigen: AveragePool3D, MaxPool3D, AveragePool3DGrad and MaxPool3DGrad.

Bug fixes:

  • Resolved issues in TensorFlow 2.12.0
  • Resolved issues in oneDNN v2.7.3
  • Issues found during static scan analyses are fixed.
  • Fixed NPE (null-pointer exception error) for min/max tensors in QuantizedMatmul.
  • Fixed a bug in Average-Pool-3D-Grad for empty input tensors by mimicking the same behavior as the Eigen-based implementation since this case is not natively supported by oneDNN.
  • Fixed a bug in LayerNorm by adding the missing epsilon attribute.
  • Fixed another bug in LayerNorm by adding Eigen threadpool interface to the oneDNN stream. This fix prevented LayerNorm from running on a single thread.
  • Fixed collective_combine_all_reduce_test_cpu and collective_test_cpu Python unit test failures on tensorflow:devel docker container due to incompatible Numpy versions.
  • Fixed a bug in the initialization of destination memory in MatMul primitive.
  • Fixed a bug in fused batch-matmul op by adding missing epsilon and leaky-relu alpha attributes.

Versions and components:

Known issues:

  • Bfloat16 is not guaranteed to work on AVX or AVX2 systems.

  • There is a known issue of low accuracy for 3DUnet mlperf bfloat16 inference, and the issue has been fixed post TF2.12 release. For a workaround in TF2.12, please add the following environment variables to run the bfloat16 inference case for this model:

  •  TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_INFERLIST_REMOVE=Mean
    
  •  TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_DENYLIST_ADD=Mean
    
  • Intel optimized TensorFlow is no longer supported on Windows:

  • To run TensorFlow on Windows, use [official TensorFlow v2.12](https://pypi.org/project/tensorflow/2.12.0/) and set the environment variable TF_ENABLE_ONEDNN_OPTS to 1 (i.e., “set TF_ENABLE_ONEDNN_OPTS=1”). Also, if the PC has hyperthreading enabled, users need to bind the ML application to one logical core per CPU in order to get the best runtime performance.
    
  • Use the initialization script from the following link, to get best performance on Windows: https://github.com/IntelAI/models/blob/r2.7/benchmarks/common/windows_intel1dnn_setenv.bat