-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable SVE Support for L2 Metric Computation in FP32 #969
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adarshs1310 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @adarshs1310! It looks like this is your first PR to zilliztech/knowhere 🎉 |
@adarshs1310 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
1783ca6
to
4186e1a
Compare
e304e26
to
fe28988
Compare
svbool_t pg = svptrue_b32(); | ||
|
||
while (i < d) { | ||
if (d - i < svcntw()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this if
condition is not needed, just pg = svwhilelt_b32(i, d);
should be sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much, @alexanderguzhva , for the valuable suggestions. During development, we considered this approach as well, and our reason for going with the current approach is as follows:
Using the if condition to update pg only in the last iteration avoids unnecessary updates and reduces the dependency chain introduced by the svwhilelt instruction. This optimization minimizes stalls caused by these dependencies, allowing the processor pipeline to operate more efficiently.
cmake/libs/libfaiss.cmake
Outdated
@@ -48,7 +48,7 @@ endif() | |||
|
|||
if(__AARCH64) | |||
set(UTILS_SRC src/simd/hook.cc src/simd/distances_ref.cc | |||
src/simd/distances_neon.cc) | |||
src/simd/distances_neon.cc src/simd/distances_sve.cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is not sufficient.
Knowhere is designed as a library that picks function pointers according to CPU capabilities, detected upon the start.
For example, SSE / AVX2 / AVX512 code files have different corresponding compile options
knowhere/cmake/libs/libfaiss.cmake
Lines 37 to 40 in 1cb9937
target_compile_options(utils_sse PRIVATE -msse4.2 -mpopcnt) | |
target_compile_options(utils_avx PRIVATE -mfma -mf16c -mavx2 -mpopcnt) | |
target_compile_options(utils_avx512 PRIVATE -mfma -mf16c -mavx512f -mavx512dq | |
-mavx512bw -mpopcnt -mavx512vl) |
So, it seems to be logical that the new SVE code should also contain some form of different flags, such as -march=armv8-a+sve
for distances_sve.cc
. And I don't believe that I see this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable feedback @alexanderguzhva
We considered two approaches: first, to keep Neon as the default until all FP32 functions for SVE have been fully implemented. Based on this, we decided to proceed with this approach.
Your approach is absolutely correct, and we agree that the Neon fallback for non-SVE functions can be effectively handled using hook.cc.
We have incorporated these changes into our solution.
97b3ee0
to
dde2c81
Compare
@alexanderguzhva @foxspy @hhy3 Requesting updates to the CI pipeline to address GCC header file conflicts. Since Ubuntu 22.04 defaults to GCC-11, there is a risk of incorrect header usage when GCC-12 is installed. To resolve this, the GCC-11 header files should be removed after installing GCC-12. The necessary changes have been incorporated into the ARM-based Ubuntu 22.04 Docker configuration as part of this PR, ensuring proper SVE compatibility. Kindly review and consider these adjustments for consistent and accurate builds. Thank you! |
/kind feature |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #969 +/- ##
=========================================
+ Coverage 0 73.01% +73.01%
=========================================
Files 0 82 +82
Lines 0 7507 +7507
=========================================
+ Hits 0 5481 +5481
- Misses 0 2026 +2026 |
dde2c81
to
9276ee8
Compare
@adarshs1310 would you please rebase on top of master? it contains a fix for UT. Thanks. |
9276ee8
to
a626265
Compare
Sure @alexanderguzhva! The rebase has been done now! |
Hi @alexanderguzhva , @foxspy , @hhy3 , Could you please look into the SSE fix when you have a chance? The ARM CI issue has already been handled in our code. Thank you so much for your support! |
SSE's CI will not block; but ARM's CI fails |
@adarshs1310 I was able to compile and run knowhere on AWS Graviton 3 using GCC-12. I see the following error during unit tests, which seems to be a minor precision issue. Still, would you be able to find the root of this problem?
Other unit tests pass. The compilation is the following (given that you have created a profile for GCC 12 for conan): mkdir build
cd build
conan install .. --build=missing -o with_diskann=True -o with_ut=True -o with_benchmark=True -s compiler.libcxx=libstdc++11 -c tools.build:cxxflags+=[\"-mcpu=neoverse-512tvb\",\"-march=native\"] -s build_type=Release --profile=gcc12
conan build .. |
704321d
to
e6abc41
Compare
Rebase has been done |
src/simd/distances_sve.cc
Outdated
#include <cmath> | ||
|
||
#include "faiss/impl/platform_macros.h" | ||
#pragma GCC optimize("O3,fast-math,inline") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adarshs1310 Could you please just remove this hacky pragma?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @alexanderguzhva I have removed it. Thanks!
fba1adb
to
09925c7
Compare
lgtm |
At this stage, I am not inclined to accept this PR, as it would require us to maintain two sets of ARM binaries when releasing binaries or containers. At this stage, I am not inclined to accept this PR, as it would require us to maintain two sets of ARM binaries when releasing binaries or containers. |
09925c7
to
59f6ebf
Compare
554a685
to
d6555e0
Compare
Thank you, @Presburger , for sharing this! We have now implemented a supports_sve() function that dynamically checks if the machine supports SVE at runtime. This approach eliminates the need to maintain separate binaries for ARM architectures. |
Hi @Presburger @foxspy @alexanderguzhva @hhy3! I am following up regarding the recent updates to this PR. As noted, we have implemented the supports_sve() function to dynamically verify SVE support at runtime hereby removing the need to maintain two binaries for arm. In addition, we have been making further progress and remain committed to improving the Knowhere ecosystem, particularly by optimizing performance. We will continue to contribute actively to this project. Could you kindly review the latest changes and let us know if this is ready to proceed? Your feedback and guidance are highly valued. |
d6555e0
to
35757c9
Compare
@adarshs1310 Thank you very much for your contribution. Could you please test on a CPU that does not support SVE to see if there are any illegal instruction issues? Additionally, when dynamic selection is supported, is the performance improvement still significant? Thank you very much for your work. |
Signed-off-by:Adarsh Srivastava <[email protected]> Signed-off-by: Adarsh Srivastava <[email protected]>
0328ad1
to
da98fcf
Compare
Hi @Presburger! Thank you for your feedback and for pointing this out. We’ve tested the latest commit on a CPU that does not support SVE, and there are no illegal instruction issues on the NEON machine (m6g.16xlarge). Additionally, there is still a significant performance boost even when dynamic selection is supported. All test cases are passing successfully on the m7g.16xlarge instance. However, on the main branch, in the test_simd function with the disabled BF16 patch, the threshold for validation is very strict, occasionally leading to test failures with a value difference of less than 0.1. It might be worth revisiting the tolerance in this case. Your feedback is highly valued, and we look forward to continuing our contributions together. |
Hi @Presburger @alexanderguzhva @foxspy @hhy3 , Just following up to check if our PR looks good to you or if there’s anything that needs to be addressed. |
fvec_inner_product_batch_4 = fvec_inner_product_batch_4_neon; | ||
fvec_L2sqr_batch_4 = fvec_L2sqr_batch_4_neon; | ||
|
||
fp16_vec_inner_product_batch_4 = fp16_vec_inner_product_batch_4_neon; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, please also implement the batch4 for the IP sve.
svbool_t pg = svptrue_b32(); | ||
|
||
while (i < n) { | ||
if (n - i < svcntw()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling of the tail block here could be more elegant, as SVE's predicate registers are inherently friendly to out-of-bounds memory access.
|
||
set(UTILS_SRC src/simd/distances_ref.cc src/simd/distances_neon.cc) | ||
set(UTILS_SVE_SRC src/simd/hook.cc src/simd/distances_sve.cc) | ||
set(ALL_UTILS_SRC ${UTILS_SRC} ${UTILS_SVE_SRC}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can directly place the SVE source file into the UTILS_SRC variable.
Description:
This PR introduces SVE (Scalable Vector Extension) enablement for L2 metric computation in FP32. The changes enhance performance for most indexing methods compared to NEON, with observed speed-ups across multiple algorithms.
Changes in This PR:
Benchmark Results:
Performance benchmarks(32 vcpus) were conducted using both NEON and SVE on ARM architecture. Below are the results showcasing execution times (in seconds):
Key observations:
Note: SVE support has been made optional, as not all functions have been fully enabled yet. To utilize SVE, please compile with -march=armv8-a+sve.
/kind feature
Fixes #782