ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels #9217

chaxu01 · 2024-08-28T08:42:43Z

Added a fallback mechanism for cases where the offline re-quantized model is not optimized for the underlying target for AArch64 GEMV and GEMM. In such situations, the mulmat will fallback to a suitable implementation, with the reference implementation as a last resort, instead of triggering an assert. This commit also aligns with an upcoming PR that dynamically detects CPU features.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- [x ] Medium
- High

chaxu01 · 2024-09-02T08:14:37Z

Hi, it looks like the checks failed for server-windows, but after reviewing the logs, it seems unrelated to my changes. Could someone with more context on the server-windows take a look? I'm happy to rebase or make any necessary adjustments if needed. Thanks!

ggerganov · 2024-09-02T11:42:41Z

ggml/src/ggml-aarch64.c

+// Print a given message only once
+static inline void print_message_once(const char* message) {
+    static bool printed = false;
+    if (!printed) {
+        fprintf(stderr, "%s\n", message);
+        printed = true;
+    }
+}
+
+// Return The number of 32-bit lanes in the SVE vector if SVE is supported; otherwise, returns 0 if SVE is not supported.
+static int sve_lane_count(void) {
+    static int lane_count = -1;
+    if (lane_count == -1) {
+#if defined(__ARM_FEATURE_SVE)
+        lane_count = ggml_sve_cnt_b;
+#else
+        lane_count = 0;
+#endif
+    }
+    return lane_count;
+}
+


I think these will generate data race warnings when the thread sanitizer is enabled.

@ggerganov Thanks for reviewing the code and I'll update to make the function thread safe.

@ggerganov I have pushed the commit that addresses your concern about the data race condition.

ggerganov

I think the fallback changes are good, but the prints are not OK for multiple reasons:

Sometimes we might want to run a given quantization type, regardless if there is a more optimal option. Yet, there would still be a warning message.
The low-level code should not deal with printing. This logic ideally has to be moved into the user code and let the application decide if and how to warn the user.
There is still a data race on warning_message

My suggestion is to reduce the PR just to the fallback related changes.

chaxu01 · 2024-09-11T17:16:43Z

@ggerganov thanks for your review. I have pushed the commit based on your suggestions. The patch is a reduced PR that only removes the assert and adds the fallback. There will be a separate PR that addresses the warning messages in the user code, suggesting that a different model could be used for better performance, if it applies.

ggerganov · 2024-09-12T07:25:02Z

ggml/src/ggml-aarch64.c

-        GGML_ASSERT(!(ggml_cpu_has_sve() && (ggml_sve_cnt_b == QK8_0)) &&
-                    "__ARM_FEATURE_SVE defined, use the Q4_0_8_8 quantization format for optimal performance");
+#if ! ((defined(_MSC_VER)) && ! defined(__clang__)) && defined(__aarch64__) && defined(__ARM_NEON)
+    if (ggml_cpu_has_neon()) {


I'm wondering if these function calls are properly inlined. AFAIK with LTO enabled, they should be, but maybe it's better if instead of relying on the ~~compilation~~linker to do it for us, we can read the value once into a static variable and check that variable from then on.

The function call overhead is probably negligible, but still, since we are in a hot loop, it might make a differnce. What do you think?

@ggerganov Thanks for the review and sorry for the late response as I was on vacation. I'll look into your suggestion.

One possible way to reduce the overhead of function calls such as ggml_cpu_has_neon() is to cache the results using static variables, as you suggested:

static bool neon_support_checked = false; static bool neon_supported = false;

To prevent race conditions, the values of neon_support_checked and neon_supported need to be protected by a mutex or atomic operations:

static pthread_mutex_t neon_check_mutex = PTHREAD_MUTEX_INITIALIZER; inline bool is_neon_supported() { pthread_mutex_lock(&neon_check_mutex); // Check only if not already checked if (!neon_support_checked) { neon_supported = ggml_cpu_has_neon(); neon_support_checked = true; } pthread_mutex_unlock(&neon_check_mutex); // Release the lock return neon_supported; }

By marking is_neon_supported() as inline, we may reduce the function call overhead. However, since the function still contains a mutex, the overall performance improvement could be limited due to the cost of acquiring and releasing the lock.

Another possible optimization might be to inline the ggml_cpu_has_neon() function itself, although that might be beyond the scope of this PR.

I'm interested to hear what you think, and I'm happy to consider any suggestions or improvements you may have.

Nah, mutex is an overkill. Let's merge it as it is. You can easily check if the function calls add any overhead by replacing with if (true). If they do, you can try to find a way to do thread-safe static init without synchronization primitives.

I’ve done some overhead tests for the function calls and didn’t observe any statistically significant performance differences. Given that, I’m fine with merging the code as is after I rebase it. Please let me know if you think I should do anything differently.

Ok, let's merge after rebase.

* added fallback mechanism when the offline re-quantized model is not optimized for the underlying target.

chaxu01 · 2024-09-25T13:26:33Z

@ggerganov Thanks so much for taking the time to review and merge my PR. I really appreciate your feedback and guidance throughout the process. Looking forward to contributing more in the future.

ggerganov · 2024-09-25T13:31:53Z

@chaxu01 Welcome and looking forward to more contributions in the future. We still don't have CI for this code due to lack of hardware, so the review process is more difficult and takes longer.

…9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 28, 2024

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 30, 2024

ggerganov reviewed Sep 2, 2024

View reviewed changes

chaxu01 force-pushed the feature/matmul-fallback branch from 1c50977 to a53cac3 Compare September 5, 2024 08:27

ggerganov reviewed Sep 6, 2024

View reviewed changes

ggerganov approved these changes Sep 12, 2024

View reviewed changes

chaxu01 added 3 commits September 25, 2024 12:19

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

baad96a

* added fallback mechanism when the offline re-quantized model is not optimized for the underlying target.

fix for build errors

8e3ca11

remove prints from the low-level code

7276e2b

chaxu01 force-pushed the feature/matmul-fallback branch from 71cf0e1 to 7276e2b Compare September 25, 2024 11:44

Rebase to the latest upstream

900e27f

ggerganov merged commit 1e43630 into ggerganov:master Sep 25, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels #9217

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels #9217

chaxu01 commented Aug 28, 2024

chaxu01 commented Sep 2, 2024

ggerganov Sep 2, 2024

chaxu01 Sep 3, 2024

chaxu01 Sep 4, 2024

ggerganov left a comment

chaxu01 commented Sep 11, 2024

ggerganov Sep 12, 2024 •

edited

Loading

chaxu01 Sep 24, 2024

chaxu01 Sep 24, 2024

ggerganov Sep 24, 2024

chaxu01 Sep 25, 2024

ggerganov Sep 25, 2024

chaxu01 commented Sep 25, 2024

ggerganov commented Sep 25, 2024

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels #9217

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels #9217

Conversation

chaxu01 commented Aug 28, 2024

chaxu01 commented Sep 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

chaxu01 commented Sep 11, 2024

ggerganov Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaxu01 commented Sep 25, 2024

ggerganov commented Sep 25, 2024

ggerganov Sep 12, 2024 •

edited

Loading