Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SVE Support for L2 Metric Computation in FP32 #969

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adarshs1310
Copy link

@adarshs1310 adarshs1310 commented Nov 29, 2024

Description:
This PR introduces SVE (Scalable Vector Extension) enablement for L2 metric computation in FP32. The changes enhance performance for most indexing methods compared to NEON, with observed speed-ups across multiple algorithms.

Changes in This PR:

  • Added SVE optimizations for L2 metric computation in FP32.
  • Updated CMakeLists.txt to include support for -march=armv8-a+sve.
  • Refactored compute kernels to leverage SVE intrinsics.

Benchmark Results:

Performance benchmarks(32 vcpus) were conducted using both NEON and SVE on ARM architecture. Below are the results showcasing execution times (in seconds):

image

Key observations:

  • All of the algorithms exhibit performance gains with SVE when compared with NEON.

Note: SVE support has been made optional, as not all functions have been fully enabled yet. To utilize SVE, please compile with -march=armv8-a+sve.

/kind feature
Fixes #782

@sre-ci-robot sre-ci-robot requested review from foxspy and hhy3 November 29, 2024 10:34
@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adarshs1310
To complete the pull request process, please assign zhengbuqian after the PR has been reviewed.
You can assign the PR to them by writing /assign @zhengbuqian in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot
Copy link
Collaborator

Welcome @adarshs1310! It looks like this is your first PR to zilliztech/knowhere 🎉

Copy link

mergify bot commented Nov 29, 2024

@adarshs1310 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

@adarshs1310 adarshs1310 force-pushed the sve-l2-fp32 branch 2 times, most recently from 1783ca6 to 4186e1a Compare November 29, 2024 11:31
@mergify mergify bot added dco-passed and removed needs-dco labels Nov 29, 2024
@adarshs1310 adarshs1310 force-pushed the sve-l2-fp32 branch 2 times, most recently from e304e26 to fe28988 Compare November 29, 2024 12:29
svbool_t pg = svptrue_b32();

while (i < d) {
if (d - i < svcntw())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this if condition is not needed, just pg = svwhilelt_b32(i, d); should be sufficient

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @alexanderguzhva , for the valuable suggestions. During development, we considered this approach as well, and our reason for going with the current approach is as follows:

Using the if condition to update pg only in the last iteration avoids unnecessary updates and reduces the dependency chain introduced by the svwhilelt instruction. This optimization minimizes stalls caused by these dependencies, allowing the processor pipeline to operate more efficiently.

@@ -48,7 +48,7 @@ endif()

if(__AARCH64)
set(UTILS_SRC src/simd/hook.cc src/simd/distances_ref.cc
src/simd/distances_neon.cc)
src/simd/distances_neon.cc src/simd/distances_sve.cc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is not sufficient.
Knowhere is designed as a library that picks function pointers according to CPU capabilities, detected upon the start.
For example, SSE / AVX2 / AVX512 code files have different corresponding compile options

target_compile_options(utils_sse PRIVATE -msse4.2 -mpopcnt)
target_compile_options(utils_avx PRIVATE -mfma -mf16c -mavx2 -mpopcnt)
target_compile_options(utils_avx512 PRIVATE -mfma -mf16c -mavx512f -mavx512dq
-mavx512bw -mpopcnt -mavx512vl)

So, it seems to be logical that the new SVE code should also contain some form of different flags, such as -march=armv8-a+sve for distances_sve.cc. And I don't believe that I see this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable feedback @alexanderguzhva

We considered two approaches: first, to keep Neon as the default until all FP32 functions for SVE have been fully implemented. Based on this, we decided to proceed with this approach.

Your approach is absolutely correct, and we agree that the Neon fallback for non-SVE functions can be effectively handled using hook.cc.

We have incorporated these changes into our solution.

@adarshs1310 adarshs1310 force-pushed the sve-l2-fp32 branch 2 times, most recently from 97b3ee0 to dde2c81 Compare November 30, 2024 07:12
@adarshs1310
Copy link
Author

@alexanderguzhva @foxspy @hhy3

Requesting updates to the CI pipeline to address GCC header file conflicts.

Since Ubuntu 22.04 defaults to GCC-11, there is a risk of incorrect header usage when GCC-12 is installed. To resolve this, the GCC-11 header files should be removed after installing GCC-12.

The necessary changes have been incorporated into the ARM-based Ubuntu 22.04 Docker configuration as part of this PR, ensuring proper SVE compatibility. Kindly review and consider these adjustments for consistent and accurate builds. Thank you!

@adarshs1310
Copy link
Author

/kind feature

Copy link

codecov bot commented Dec 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.01%. Comparing base (3c46f4c) to head (da98fcf).
Report is 297 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           main     #969       +/-   ##
=========================================
+ Coverage      0   73.01%   +73.01%     
=========================================
  Files         0       82       +82     
  Lines         0     7507     +7507     
=========================================
+ Hits          0     5481     +5481     
- Misses        0     2026     +2026     

see 82 files with indirect coverage changes

@alexanderguzhva
Copy link
Collaborator

@adarshs1310 would you please rebase on top of master? it contains a fix for UT. Thanks.

@adarshs1310
Copy link
Author

Sure @alexanderguzhva! The rebase has been done now!

@adarshs1310
Copy link
Author

Hi @alexanderguzhva , @foxspy , @hhy3 ,

Could you please look into the SSE fix when you have a chance? The ARM CI issue has already been handled in our code.

Thank you so much for your support!

@foxspy
Copy link
Collaborator

foxspy commented Dec 9, 2024

Hi @alexanderguzhva , @foxspy , @hhy3 ,

Could you please look into the SSE fix when you have a chance? The ARM CI issue has already been handled in our code.

Thank you so much for your support!

SSE's CI will not block; but ARM's CI fails

@alexanderguzhva
Copy link
Collaborator

alexanderguzhva commented Dec 9, 2024

@adarshs1310 I was able to compile and run knowhere on AWS Graviton 3 using GCC-12. I see the following error during unit tests, which seems to be a minor precision issue. Still, would you be able to find the root of this problem?

-------------------------------------------------------------------------------
Test Brute Force with input ids
-------------------------------------------------------------------------------
/home/ubuntu/zilliz/knowhere_sve/knowhere/tests/ut/test_bruteforce.cc:201
...............................................................................

/home/ubuntu/zilliz/knowhere_sve/knowhere/tests/ut/test_bruteforce.cc:243: FAILED:
  REQUIRE( gt_dis[i] == dis[i] )
with expansion:
  156769.3125f == 156769.32812f

Other unit tests pass.

The compilation is the following (given that you have created a profile for GCC 12 for conan):

mkdir build
cd build
conan install .. --build=missing -o with_diskann=True -o with_ut=True -o with_benchmark=True -s compiler.libcxx=libstdc++11 -c tools.build:cxxflags+=[\"-mcpu=neoverse-512tvb\",\"-march=native\"] -s build_type=Release --profile=gcc12
conan build ..

@adarshs1310
Copy link
Author

adarshs1310 commented Dec 17, 2024

Rebase has been done

@foxspy @alexanderguzhva @hhy3

#include <cmath>

#include "faiss/impl/platform_macros.h"
#pragma GCC optimize("O3,fast-math,inline")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adarshs1310 Could you please just remove this hacky pragma?
Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @alexanderguzhva I have removed it. Thanks!

@adarshs1310 adarshs1310 force-pushed the sve-l2-fp32 branch 2 times, most recently from fba1adb to 09925c7 Compare December 17, 2024 15:39
@mergify mergify bot added the ci-passed label Dec 17, 2024
@alexanderguzhva
Copy link
Collaborator

lgtm

@adarshs1310
Copy link
Author

Hi @foxspy and @hhy3! Can we Please get an update on this?

Thank you so much!

@Presburger
Copy link
Collaborator

At this stage, I am not inclined to accept this PR, as it would require us to maintain two sets of ARM binaries when releasing binaries or containers. At this stage, I am not inclined to accept this PR, as it would require us to maintain two sets of ARM binaries when releasing binaries or containers.

@mergify mergify bot removed the ci-passed label Jan 6, 2025
@adarshs1310 adarshs1310 force-pushed the sve-l2-fp32 branch 2 times, most recently from 554a685 to d6555e0 Compare January 6, 2025 07:20
@adarshs1310
Copy link
Author

At this stage, I am not inclined to accept this PR, as it would require us to maintain two sets of ARM binaries when releasing binaries or containers. At this stage, I am not inclined to accept this PR, as it would require us to maintain two sets of ARM binaries when releasing binaries or containers.

Thank you, @Presburger , for sharing this! We have now implemented a supports_sve() function that dynamically checks if the machine supports SVE at runtime. This approach eliminates the need to maintain separate binaries for ARM architectures.

@adarshs1310
Copy link
Author

Hi @Presburger @foxspy @alexanderguzhva @hhy3!

I am following up regarding the recent updates to this PR. As noted, we have implemented the supports_sve() function to dynamically verify SVE support at runtime hereby removing the need to maintain two binaries for arm.

In addition, we have been making further progress and remain committed to improving the Knowhere ecosystem, particularly by optimizing performance. We will continue to contribute actively to this project.

Could you kindly review the latest changes and let us know if this is ready to proceed?

Your feedback and guidance are highly valued.

@Presburger
Copy link
Collaborator

@adarshs1310 Thank you very much for your contribution. Could you please test on a CPU that does not support SVE to see if there are any illegal instruction issues? Additionally, when dynamic selection is supported, is the performance improvement still significant? Thank you very much for your work.

Signed-off-by:Adarsh Srivastava <[email protected]>
Signed-off-by: Adarsh Srivastava <[email protected]>
@adarshs1310
Copy link
Author

image

Hi @Presburger!

Thank you for your feedback and for pointing this out. We’ve tested the latest commit on a CPU that does not support SVE, and there are no illegal instruction issues on the NEON machine (m6g.16xlarge). Additionally, there is still a significant performance boost even when dynamic selection is supported. All test cases are passing successfully on the m7g.16xlarge instance.

However, on the main branch, in the test_simd function with the disabled BF16 patch, the threshold for validation is very strict, occasionally leading to test failures with a value difference of less than 0.1. It might be worth revisiting the tolerance in this case.

Your feedback is highly valued, and we look forward to continuing our contributions together.

@mergify mergify bot added the ci-passed label Jan 20, 2025
@adarshs1310
Copy link
Author

Hi @Presburger @alexanderguzhva @foxspy @hhy3 ,

Just following up to check if our PR looks good to you or if there’s anything that needs to be addressed.

fvec_inner_product_batch_4 = fvec_inner_product_batch_4_neon;
fvec_L2sqr_batch_4 = fvec_L2sqr_batch_4_neon;

fp16_vec_inner_product_batch_4 = fp16_vec_inner_product_batch_4_neon;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, please also implement the batch4 for the IP sve.

svbool_t pg = svptrue_b32();

while (i < n) {
if (n - i < svcntw())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of the tail block here could be more elegant, as SVE's predicate registers are inherently friendly to out-of-bounds memory access.


set(UTILS_SRC src/simd/distances_ref.cc src/simd/distances_neon.cc)
set(UTILS_SVE_SRC src/simd/hook.cc src/simd/distances_sve.cc)
set(ALL_UTILS_SRC ${UTILS_SRC} ${UTILS_SVE_SRC})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can directly place the SVE source file into the UTILS_SRC variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support SVE
6 participants