Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Unintended behavior in llama_decode_internal when cparams.embeddings is True and cparams.pooling_type is LLAMA_POOLING_TYPE_NONE #8956

Closed
hitzelc opened this issue Aug 9, 2024 · 4 comments · Fixed by #8981
Labels
bug Something isn't working low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Comments

@hitzelc
Copy link

hitzelc commented Aug 9, 2024

In llama_decode_internal, it is currently the case that if cparams.embeddings is True but cparams.pooling_type is 0, the following will always result in a failed assertion:

llama.cpp/src/llama.cpp

Lines 14619 to 14625 in 6afd1a9

} else if (cparams.embeddings) {
res = nullptr; // do not extract logits for embedding case
embd = gf->nodes[gf->n_nodes - 1];
if (strcmp(embd->name, "result_embd_pooled") != 0) {
embd = gf->nodes[gf->n_nodes - 2];
}
GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor");

I noticed this bool:
const bool embd_pooled = cparams.embeddings && cparams.pooling_type != LLAMA_POOLING_TYPE_NONE;
is set just a few lines above, so I'm thinking there might be some intended behavior for when no pooling strategy is selected?

Or perhaps there should simply be a warning or a default pooling strategy forced here?

@fairydreaming
Copy link
Collaborator

fairydreaming commented Aug 10, 2024

Hmm, I just added support for pooling type none in llama-embedding and from what I see it works just fine. I think result_embd_pooled tensor is always present when cparams.embeddings is set. It looks like append_pooling() simply renames the result_norm tensor to result_embd_pooled in this case (LLAMA_POOLING_TYPE_NONE).

@hitzelc
Copy link
Author

hitzelc commented Aug 11, 2024

Thank you! I think your change in #8960 came just 15 hours after I posted this.

As of your change and using llama-embedding

version: 3565 (6e02327)
built with Apple clang version 14.0.0 (clang-1400.0.29.202) for x86_64-apple-darwin21.6.0

I can now run e.g.
llama-embedding -m nomic-embed-text-v1.5.Q8_0.gguf --pooling none -p "abctest"
without the assertion failing.

Interestingly, for this model, running
llama-embedding -m gemma-2-9b-it-Q6_K_L.gguf --pooling none -p "abctest"
still fails on that same assertion I mentioned:

/tmp/llama.cpp-20240810-5317-gbb4w6/src/llama.cpp:14730: GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor") failed

@fairydreaming
Copy link
Collaborator

fairydreaming commented Aug 11, 2024

@hitzelc Indeed I confirm that it still fails for gemma-2. This model is a bit special, as it has some additional tensor operations between "result_norm" and "result_output" because of logits soft-capping:

        cur = llm_build_norm(ctx0, cur, hparams,
                model.output_norm, NULL,
                LLM_NORM_RMS, cb, -1);
        cb(cur, "result_norm", -1);

        // lm_head
        cur = llm_build_lora_mm(lctx, ctx0, model.output, cur);

        // final logit soft-capping
        cur = ggml_scale(ctx0, cur, 1.0f / hparams.f_final_logit_softcapping);
        cur = ggml_tanh(ctx0, cur);
        cur = ggml_scale(ctx0, cur, hparams.f_final_logit_softcapping);

        cb(cur, "result_output", -1);

That causes the code that looks for result_norm renamed to result_embd_pooled at graph node position gf->n_nodes - 2 to fail.

This little patch shall fix the problem (it searches the whole graph for result_embd_pooled instead of checking only the two last nodes) :

diff --git a/src/llama.cpp b/src/llama.cpp
index e0fe8013..aaf8db49 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -14722,12 +14722,15 @@ static int llama_decode_internal(
             res  = nullptr;
             embd = nullptr;
         } else if (cparams.embeddings) {
-            res = nullptr; // do not extract logits for embedding case
-            embd = gf->nodes[gf->n_nodes - 1];
-            if (strcmp(embd->name, "result_embd_pooled") != 0) {
-                embd = gf->nodes[gf->n_nodes - 2];
+            res  = nullptr; // do not extract logits for embedding case
+            embd = nullptr;
+            for (int i = gf->n_nodes - 1; i >= 0; --i) {
+                if (strcmp(gf->nodes[i]->name, "result_embd_pooled") == 0) {
+                    embd = gf->nodes[i];
+                    break;
+                }
             }
-            GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor");
+            GGML_ASSERT(embd != nullptr && "missing embeddings tensor");
         } else {
             embd = nullptr; // do not extract embeddings when not needed
             GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor");

@fairydreaming fairydreaming added bug Something isn't working low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Aug 11, 2024
@hitzelc
Copy link
Author

hitzelc commented Aug 12, 2024

@hitzelc Indeed I confirm that it still fails for gemma-2. This model is a bit special, as it has some additional tensor operations between "result_norm" and "result_output" because of logits soft-capping:

        cur = llm_build_norm(ctx0, cur, hparams,
                model.output_norm, NULL,
                LLM_NORM_RMS, cb, -1);
        cb(cur, "result_norm", -1);

        // lm_head
        cur = llm_build_lora_mm(lctx, ctx0, model.output, cur);

        // final logit soft-capping
        cur = ggml_scale(ctx0, cur, 1.0f / hparams.f_final_logit_softcapping);
        cur = ggml_tanh(ctx0, cur);
        cur = ggml_scale(ctx0, cur, hparams.f_final_logit_softcapping);

        cb(cur, "result_output", -1);

That causes the code that looks for result_norm renamed to result_embd_pooled at graph node position gf->n_nodes - 2 to fail.

This little patch shall fix the problem (it searches the whole graph for result_embd_pooled instead of checking only the two last nodes) :

diff --git a/src/llama.cpp b/src/llama.cpp
index e0fe8013..aaf8db49 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -14722,12 +14722,15 @@ static int llama_decode_internal(
             res  = nullptr;
             embd = nullptr;
         } else if (cparams.embeddings) {
-            res = nullptr; // do not extract logits for embedding case
-            embd = gf->nodes[gf->n_nodes - 1];
-            if (strcmp(embd->name, "result_embd_pooled") != 0) {
-                embd = gf->nodes[gf->n_nodes - 2];
+            res  = nullptr; // do not extract logits for embedding case
+            embd = nullptr;
+            for (int i = gf->n_nodes - 1; i >= 0; --i) {
+                if (strcmp(gf->nodes[i]->name, "result_embd_pooled") == 0) {
+                    embd = gf->nodes[i];
+                    break;
+                }
             }
-            GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor");
+            GGML_ASSERT(embd != nullptr && "missing embeddings tensor");
         } else {
             embd = nullptr; // do not extract embeddings when not needed
             GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor");

Nice, working for me as well, and I very much appreciate your explaining the difference between models. Thank you kindly.

arthw pushed a commit to arthw/llama.cpp that referenced this issue Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this issue Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Projects
None yet
2 participants