Enable AMX to hnswlib for FP32 innerproduct. #609
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi~, reviewer, Intel Advanced Matrix Extensions (AMX) is a set of specialized instructions designed to accelerate matrix operations, which are fundamental in many areas of modern computing such as machine learning, scientific computing, and graphics processing. AMX leverages dedicated tile-based architecture within the CPU to perform these operations more efficiently than traditional scalar or SIMD (Single Instruction, Multiple Data) methods.
During our analysis of vector search performance in hnswlib, we identified that the SIMD-based IP distance computation was a significant bottleneck. Recognizing the potential for optimization with AMX, we implemented changes that replaced SIMD operations with AMX, achieving a 1.08x performance gain. This finding prompted us to submit this patch.
It is worth noting that AMX exclusively supports FP16, BF16, and INT8 data types. Consequently, in this patch, we perform a conversion from FP32 to BF16 prior to each computation. Furthermore, we have developed an alternative patch that eliminates the need for data type conversion by using BF16 for both storage and computation, leading to a more substantial performance improvement of 1.78x. We are prepared to submit a new pull request for this alternative approach if it is deemed appropriate.
All tests were performed on an Intel(R) Xeon(R) 6980P processor, employing 16 physical cores and 16 threads. The dataset utilized for benchmarking purposes was 1000*1024 vectors.