Bug: Unintended behavior in llama_decode_internal when cparams.embeddings is True and cparams.pooling_type is LLAMA_POOLING_TYPE_NONE #8956

hitzelc · 2024-08-09T18:12:51Z

In llama_decode_internal, it is currently the case that if cparams.embeddings is True but cparams.pooling_type is 0, the following will always result in a failed assertion:

llama.cpp/src/llama.cpp

Lines 14619 to 14625 in 6afd1a9

    
           } else if (cparams.embeddings) { 
        
               res = nullptr; // do not extract logits for embedding case 
        
               embd = gf->nodes[gf->n_nodes - 1]; 
        
               if (strcmp(embd->name, "result_embd_pooled") != 0) { 
        
                   embd = gf->nodes[gf->n_nodes - 2]; 
        
               } 
        
               GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor");

I noticed this bool:
const bool embd_pooled = cparams.embeddings && cparams.pooling_type != LLAMA_POOLING_TYPE_NONE;
is set just a few lines above, so I'm thinking there might be some intended behavior for when no pooling strategy is selected?

Or perhaps there should simply be a warning or a default pooling strategy forced here?

The text was updated successfully, but these errors were encountered:

fairydreaming · 2024-08-10T10:07:18Z

Hmm, I just added support for pooling type none in llama-embedding and from what I see it works just fine. I think result_embd_pooled tensor is always present when cparams.embeddings is set. It looks like append_pooling() simply renames the result_norm tensor to result_embd_pooled in this case (LLAMA_POOLING_TYPE_NONE).

hitzelc · 2024-08-11T01:25:31Z

Thank you! I think your change in #8960 came just 15 hours after I posted this.

As of your change and using llama-embedding

version: 3565 (6e02327)
built with Apple clang version 14.0.0 (clang-1400.0.29.202) for x86_64-apple-darwin21.6.0

I can now run e.g.
llama-embedding -m nomic-embed-text-v1.5.Q8_0.gguf --pooling none -p "abctest"
without the assertion failing.

Interestingly, for this model, running
llama-embedding -m gemma-2-9b-it-Q6_K_L.gguf --pooling none -p "abctest"
still fails on that same assertion I mentioned:

/tmp/llama.cpp-20240810-5317-gbb4w6/src/llama.cpp:14730: GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor") failed

fairydreaming · 2024-08-11T05:33:31Z

@hitzelc Indeed I confirm that it still fails for gemma-2. This model is a bit special, as it has some additional tensor operations between "result_norm" and "result_output" because of logits soft-capping:

        cur = llm_build_norm(ctx0, cur, hparams,
                model.output_norm, NULL,
                LLM_NORM_RMS, cb, -1);
        cb(cur, "result_norm", -1);

        // lm_head
        cur = llm_build_lora_mm(lctx, ctx0, model.output, cur);

        // final logit soft-capping
        cur = ggml_scale(ctx0, cur, 1.0f / hparams.f_final_logit_softcapping);
        cur = ggml_tanh(ctx0, cur);
        cur = ggml_scale(ctx0, cur, hparams.f_final_logit_softcapping);

        cb(cur, "result_output", -1);

That causes the code that looks for result_norm renamed to result_embd_pooled at graph node position gf->n_nodes - 2 to fail.

This little patch shall fix the problem (it searches the whole graph for result_embd_pooled instead of checking only the two last nodes) :

diff --git a/src/llama.cpp b/src/llama.cpp
index e0fe8013..aaf8db49 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -14722,12 +14722,15 @@ static int llama_decode_internal(
             res  = nullptr;
             embd = nullptr;
         } else if (cparams.embeddings) {
-            res = nullptr; // do not extract logits for embedding case
-            embd = gf->nodes[gf->n_nodes - 1];
-            if (strcmp(embd->name, "result_embd_pooled") != 0) {
-                embd = gf->nodes[gf->n_nodes - 2];
+            res  = nullptr; // do not extract logits for embedding case
+            embd = nullptr;
+            for (int i = gf->n_nodes - 1; i >= 0; --i) {
+                if (strcmp(gf->nodes[i]->name, "result_embd_pooled") == 0) {
+                    embd = gf->nodes[i];
+                    break;
+                }
             }
-            GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor");
+            GGML_ASSERT(embd != nullptr && "missing embeddings tensor");
         } else {
             embd = nullptr; // do not extract embeddings when not needed
             GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor");

…8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

hitzelc · 2024-08-12T18:30:27Z

@hitzelc Indeed I confirm that it still fails for gemma-2. This model is a bit special, as it has some additional tensor operations between "result_norm" and "result_output" because of logits soft-capping:

        cur = llm_build_norm(ctx0, cur, hparams,
                model.output_norm, NULL,
                LLM_NORM_RMS, cb, -1);
        cb(cur, "result_norm", -1);

        // lm_head
        cur = llm_build_lora_mm(lctx, ctx0, model.output, cur);

        // final logit soft-capping
        cur = ggml_scale(ctx0, cur, 1.0f / hparams.f_final_logit_softcapping);
        cur = ggml_tanh(ctx0, cur);
        cur = ggml_scale(ctx0, cur, hparams.f_final_logit_softcapping);

        cb(cur, "result_output", -1);

That causes the code that looks for result_norm renamed to result_embd_pooled at graph node position gf->n_nodes - 2 to fail.

This little patch shall fix the problem (it searches the whole graph for result_embd_pooled instead of checking only the two last nodes) :

diff --git a/src/llama.cpp b/src/llama.cpp
index e0fe8013..aaf8db49 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -14722,12 +14722,15 @@ static int llama_decode_internal(
             res  = nullptr;
             embd = nullptr;
         } else if (cparams.embeddings) {
-            res = nullptr; // do not extract logits for embedding case
-            embd = gf->nodes[gf->n_nodes - 1];
-            if (strcmp(embd->name, "result_embd_pooled") != 0) {
-                embd = gf->nodes[gf->n_nodes - 2];
+            res  = nullptr; // do not extract logits for embedding case
+            embd = nullptr;
+            for (int i = gf->n_nodes - 1; i >= 0; --i) {
+                if (strcmp(gf->nodes[i]->name, "result_embd_pooled") == 0) {
+                    embd = gf->nodes[i];
+                    break;
+                }
             }
-            GGML_ASSERT(strcmp(embd->name, "result_embd_pooled") == 0 && "missing embeddings tensor");
+            GGML_ASSERT(embd != nullptr && "missing embeddings tensor");
         } else {
             embd = nullptr; // do not extract embeddings when not needed
             GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor");

Nice, working for me as well, and I very much appreciate your explaining the difference between models. Thank you kindly.

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

fairydreaming added bug Something isn't working low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Aug 11, 2024

fairydreaming mentioned this issue Aug 11, 2024

Check all graph nodes when searching for result_embd_pooled (needed for gemma-2) #8981

Merged

4 tasks

fairydreaming added a commit that referenced this issue Aug 11, 2024

llama : check all graph nodes when searching for result_embd_pooled (#…

33309f6

…8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

fairydreaming closed this as completed in #8981 Aug 11, 2024

arthw pushed a commit to arthw/llama.cpp that referenced this issue Nov 15, 2024

llama : check all graph nodes when searching for result_embd_pooled (g…

f7aac82

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

arthw pushed a commit to arthw/llama.cpp that referenced this issue Nov 18, 2024

llama : check all graph nodes when searching for result_embd_pooled (g…

5693595

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Unintended behavior in llama_decode_internal when cparams.embeddings is True and cparams.pooling_type is LLAMA_POOLING_TYPE_NONE #8956

Bug: Unintended behavior in llama_decode_internal when cparams.embeddings is True and cparams.pooling_type is LLAMA_POOLING_TYPE_NONE #8956

hitzelc commented Aug 9, 2024

fairydreaming commented Aug 10, 2024 •

edited

Loading

hitzelc commented Aug 11, 2024

fairydreaming commented Aug 11, 2024 •

edited

Loading

hitzelc commented Aug 12, 2024

Bug: Unintended behavior in llama_decode_internal when cparams.embeddings is True and cparams.pooling_type is LLAMA_POOLING_TYPE_NONE #8956

Bug: Unintended behavior in llama_decode_internal when cparams.embeddings is True and cparams.pooling_type is LLAMA_POOLING_TYPE_NONE #8956

Comments

hitzelc commented Aug 9, 2024

fairydreaming commented Aug 10, 2024 • edited Loading

hitzelc commented Aug 11, 2024

fairydreaming commented Aug 11, 2024 • edited Loading

hitzelc commented Aug 12, 2024

fairydreaming commented Aug 10, 2024 •

edited

Loading

fairydreaming commented Aug 11, 2024 •

edited

Loading