Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels #9217

Merged
merged 4 commits into from
Sep 25, 2024

Conversation

chaxu01
Copy link
Collaborator

@chaxu01 chaxu01 commented Aug 28, 2024

Added a fallback mechanism for cases where the offline re-quantized model is not optimized for the underlying target for AArch64 GEMV and GEMM. In such situations, the mulmat will fallback to a suitable implementation, with the reference implementation as a last resort, instead of triggering an assert. This commit also aligns with an upcoming PR that dynamically detects CPU features.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 28, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 30, 2024
@chaxu01
Copy link
Collaborator Author

chaxu01 commented Sep 2, 2024

Hi, it looks like the checks failed for server-windows, but after reviewing the logs, it seems unrelated to my changes. Could someone with more context on the server-windows take a look? I'm happy to rebase or make any necessary adjustments if needed. Thanks!

Comment on lines 357 to 609
// Print a given message only once
static inline void print_message_once(const char* message) {
static bool printed = false;
if (!printed) {
fprintf(stderr, "%s\n", message);
printed = true;
}
}

// Return The number of 32-bit lanes in the SVE vector if SVE is supported; otherwise, returns 0 if SVE is not supported.
static int sve_lane_count(void) {
static int lane_count = -1;
if (lane_count == -1) {
#if defined(__ARM_FEATURE_SVE)
lane_count = ggml_sve_cnt_b;
#else
lane_count = 0;
#endif
}
return lane_count;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these will generate data race warnings when the thread sanitizer is enabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Thanks for reviewing the code and I'll update to make the function thread safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov I have pushed the commit that addresses your concern about the data race condition.

@chaxu01 chaxu01 force-pushed the feature/matmul-fallback branch from 1c50977 to a53cac3 Compare September 5, 2024 08:27
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fallback changes are good, but the prints are not OK for multiple reasons:

  • Sometimes we might want to run a given quantization type, regardless if there is a more optimal option. Yet, there would still be a warning message.
  • The low-level code should not deal with printing. This logic ideally has to be moved into the user code and let the application decide if and how to warn the user.
  • There is still a data race on warning_message

My suggestion is to reduce the PR just to the fallback related changes.

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Sep 11, 2024

@ggerganov thanks for your review. I have pushed the commit based on your suggestions. The patch is a reduced PR that only removes the assert and adds the fallback. There will be a separate PR that addresses the warning messages in the user code, suggesting that a different model could be used for better performance, if it applies.

GGML_ASSERT(!(ggml_cpu_has_sve() && (ggml_sve_cnt_b == QK8_0)) &&
"__ARM_FEATURE_SVE defined, use the Q4_0_8_8 quantization format for optimal performance");
#if ! ((defined(_MSC_VER)) && ! defined(__clang__)) && defined(__aarch64__) && defined(__ARM_NEON)
if (ggml_cpu_has_neon()) {
Copy link
Owner

@ggerganov ggerganov Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if these function calls are properly inlined. AFAIK with LTO enabled, they should be, but maybe it's better if instead of relying on the compilationlinker to do it for us, we can read the value once into a static variable and check that variable from then on.

The function call overhead is probably negligible, but still, since we are in a hot loop, it might make a differnce. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Thanks for the review and sorry for the late response as I was on vacation. I'll look into your suggestion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible way to reduce the overhead of function calls such as ggml_cpu_has_neon() is to cache the results using static variables, as you suggested:

static bool neon_support_checked = false;
static bool neon_supported = false;

To prevent race conditions, the values of neon_support_checked and neon_supported need to be protected by a mutex or atomic operations:

static pthread_mutex_t neon_check_mutex = PTHREAD_MUTEX_INITIALIZER;

inline bool is_neon_supported() {
    pthread_mutex_lock(&neon_check_mutex);

    // Check only if not already checked
    if (!neon_support_checked) {
        neon_supported = ggml_cpu_has_neon();
        neon_support_checked = true;
    }

    pthread_mutex_unlock(&neon_check_mutex);  // Release the lock
    return neon_supported;
}

By marking is_neon_supported() as inline, we may reduce the function call overhead. However, since the function still contains a mutex, the overall performance improvement could be limited due to the cost of acquiring and releasing the lock.

Another possible optimization might be to inline the ggml_cpu_has_neon() function itself, although that might be beyond the scope of this PR.

I'm interested to hear what you think, and I'm happy to consider any suggestions or improvements you may have.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, mutex is an overkill. Let's merge it as it is. You can easily check if the function calls add any overhead by replacing with if (true). If they do, you can try to find a way to do thread-safe static init without synchronization primitives.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve done some overhead tests for the function calls and didn’t observe any statistically significant performance differences. Given that, I’m fine with merging the code as is after I rebase it. Please let me know if you think I should do anything differently.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's merge after rebase.

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.
@chaxu01 chaxu01 force-pushed the feature/matmul-fallback branch from 71cf0e1 to 7276e2b Compare September 25, 2024 11:44
@ggerganov ggerganov merged commit 1e43630 into ggerganov:master Sep 25, 2024
53 checks passed
@chaxu01
Copy link
Collaborator Author

chaxu01 commented Sep 25, 2024

@ggerganov Thanks so much for taking the time to review and merge my PR. I really appreciate your feedback and guidance throughout the process. Looking forward to contributing more in the future.

@ggerganov
Copy link
Owner

@chaxu01 Welcome and looking forward to more contributions in the future. We still don't have CI for this code due to lack of hardware, so the review process is more difficult and takes longer.

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
…9217)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…9217)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…9217)

* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants