Releases: amd/blis
Releases · amd/blis
AOCL-BLAS 5.0
AOCL-BLAS 5.0 Release Highlights
- Added zen5 support
- Turin specific tuning for the APIs: D/ZGEMM, DTRSM and DNRM2
- AVX512 made improvements for the APIs: ZGEMV, D/ZAXPYF, D/ZDOTXF, ZDOTV, C/ZSCALV, DNRM2, S/D/ZCOPY, S/D/C/ZAXPBYV, DTRSV, DGEMMT, D/ZTRSM, and D/ZGEMM
- Improvements to the AOCL_ENABLE_INSTRUCTIONS functionality
- Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on
AOCL-BLAS 4.2
AOCL-BLAS 4.2 Release Highlights
- Added uint8 output and zero-point support in int8 API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
- Improved performance for all downscaled versions of all API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
- Multithread performance improved across API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
- Introduced AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative to BLIS_ARCH_TYPE, but with slightly different semantics.
- Improved functionality of XERBLA error handling routine in AOCL-BLAS.
- Performance optimizations for the following APIs:
- DGEMM for tiny sizes
- S/ZGEMM, D/ZTRSM, ZAXPBYV, Z/ZDSCALV, S/D/ZGEMV, and D/DZNRM2 - Following BLAS extension APIs have been added only for AMD “Zen” code paths:
- sgemm_pack_get_size(), sgemm_pack(), and sgemm_compute()
- dgemm_pack_get_size(), dgemm_pack(), and dgemm_compute()
AOCL-BLAS 4.1
AOCL-BLAS 4.1 Release Highlights
- Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on
- Dynamic dispatch and amdzen configuration support added to aocl_gemm add-on
- Dynamic dispatch feature enhancements.
- AVX 512-based optimizations for AMD “Zen4” platform:
- SGEMM, DGEMM, and ZGEMM
- DTRSM, D/ZAXPY, ZGEMV, DDOTV, and D/ZSCALV - Improved support for OpenMP nested parallelism.
AOCL-BLIS 4.0
Highlights of AOCL-BLIS 4.0
- The following LPGEMM (Low Precision GEMM) variants are added along with post-ops support:
- aocl_gemm_u8s8s32os32 and aocl_gemm_u8s8s32os8 routines are added and optimized using AVX-512-VNNI
- aocl_gemm_u8s8s16os16 and aocl_gemm_u8s8s16os8 routines are added and optimized using AVX2
- aocl_gemm_bf16bf16f32of32 and aocl_gemm_bf16bf16f32obf16 routines are added and optimized using AVX-512
- SGEMM with packed/reorder buffer support (aocl_gemm_f32f32f32f32)
- AMD “Zen4” support for BLIS
- Dynamic dispatch supports AMD “Zen4” configuration
- Optimizations and performance improvements for DGEMM, SGEMM, ZGEMM, DGEMMT, and DTRSM
- Framework design changes
AOCL-BLIS 3.2
New features:
- Extended BLAS function - DZGEMM
- Progress feature for xGEMM and xTRSM APIs: Time taken to complete the mathematical operations tends to increase exponentially with large input problem sizes; this feature provides users a periodic update on the operation progress.
- Runtime Threading control using OpenMP APIs
- Dynamic Dispatch covers APUs
- Improved detection of standard x86-64 feature support
- Minor bug fixes
Performance improvements in the following single-threaded and multi-threaded functions:
- DGEMM, SGEMM, ZGEMM, and CGEMM
- DTRSM, DGEMMT, ZTRSM, CTRSM, and DTRMM
- SGEMV, DHER2, ZTRSV, and DSYMV
- ?AXPBYV, SSCALV, DSCALV, ?DOTXV, and ZAXPY2V
AMD Optimized BLIS Version 3.1
AMD Optimized BLIS Version 3.1
Highlights of improvements on AMD EPYCTM processor family CPUs
- Supports Dynamic Dispatch and AOCL Dynamic feature
- Improvements in DGEMM, ZGEMM, DTRSM, DSYRK, xGEMV, and DOTV
AMD Optimized BLIS Version 3.0.1
AMD Optimized BLIS Version 3.0.1
Highlights of improvements on AMD EPYCTM processor family CPUs
- Improved performance of DGEMM for skinny matrix shapes.
- Improvements in SGEMM and ZGEMM
- Improved performance of Level-1 and Level2 BLAS routines, GEMV, DOT and AXPY routines
- Improvements in DTRSM for small matrix sizes
AMD Optimized BLIS Version 3.0
AMD Optimized BLIS Version 3.0
Highlights of improvements on AMD EPYCTM processor family CPUs
- Includes support for AMD’s Zen3 architecture. Build can auto detect if it is running on zen3 and enable features and optimizations specific to zen3 architecture.
- Improved performance of ?dotv, ?gemv, ?axpyv for complex and double complex datatypes
- Includes support for copy transposition routines
- New BLAS extension APIs added including cblas_?cabs1, cblas_i?amin, cblas_?axpby, cblas_?gemm_batch, cblas_?gemm3m
- Debug trace and input logging support added for more BLIS APIs.
AMD Optimized BLIS Version 2.2
AMD Optimized BLIS Version 2.2
Highlights of improvements on AMD EPYCTM processor family CPUs
- Improved performance for Level-1 BLAS routines for single and double precision.
- Improved performance of SGEMV and DGEMV for large sizes.
- Enabled small unpacked(SUP) GEMM kernels for single precision and double precision complex (C,Z) GEMM
- Multi-threaded small unpacked(SUP) GEMM kernels enabled for (S,D,C,Z) GEMM providing improved performance for small/skinny matrices.
- GEMM Selective packing feature is now multithread enabled. Selective packing feature packs either A or B or both the matrices and can be enabled by setting environment variable. Refer AOCL User Guide at https://developer.amd.com/amd-aocl/ for details
- Improved TRSM single-thread and multi-thread performance for large and skinny matrices
- Debug trace and log feature enabled for debug purposes.
AMD Optimized BLIS Version 2.1
AMD Optimized BLIS Version 2.1
Highlights of improvements on AMD EPYCTM processor family CPUs
- Improved performance of SGEMM and DGEMM for small and skinny size matrices
- Improved TRSM single thread performance for small and skinny size matrices
- BLIS build now supports both AMD "zen" and "zen2" configurations with auto config option
- Support for C++ Template APIs for all BLAS functions