Refactor: Allow adding both tokens and embeddings to `llama_batch` #10381

ngxson · 2024-11-18T11:17:22Z

Background Description

Ref: #7553 , required for supporting future vision models (#8010)

I initially planned to make a proposal PR for this, but turns out it's quite more complicated than I thought. Therefore, I create this issue for further discussion before actually implement it.

Possible Refactor Approaches

The problem can be divided into 2 parts:

How the llama_batch can be constructed?
How the cgraph should be modified?

For the second part (How the cgraph should be modified?), it should be simple: llm_build_inp_embd can be modified to concat tensors from "learned" embd and input embd. The pseudo-code looks like this:

lctx.inp_tokens = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, batch.n_tokens);
struct ggml_tensor * learned_embd = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);
struct ggml_tensor * inp_embd     = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, batch.n_tokens);
inpL = ggml_concat(ctx, learned_embd, inp_embd);

The attention mask also need to be updated accordingly.

For the first part (How the llama_batch can be constructed?), the problem is that there are many different possible approach:

Proposal 1: Add `n_embds` to `llama_batch`

    typedef struct llama_batch {
        int32_t n_tokens;
        int32_t n_embds;

        llama_token  *  token;
        float        *  embd; // has n_embds * dim_embd elements
        int32_t      *  n_seq_id; // has n_tokens+n_embds elements
        ...
    }

The downside of this approach is that it's quite messy to keep track of n_seq_id, seq_id, logits

Proposal 2: Add an overload version of `llama_decode/encode`

llama_decode(struct llama_context * ctx, struct llama_batch * batch_tokens, struct llama_batch * batch_embd);

The downside would be that this is kinda a "hacky" (not intuitive for developers), because one batch is now represented by 2 llama_batch objects.

Proposal 3: Keep `llama_batch` the same, but tokens ID < 0 are embeddings

For example:

batch.token = { 1, 842, -1, -1, 242, -1 };
batch.embd = {
  -0.4, 0.2, 0.12,..., // correspond to batch.token[2]
  0.04, 0.02, 0.3,..., // correspond to batch.token[3]
  0.04, 0.1, -0.3,..., // correspond to batch.token[5]
};

This seems to be easier to implement than all other proposals. The only thing I'm not sure is that do we expect negative token ID to be a reserved use case?

Proposal 4: Completely refactor `llama_batch` to accept sequence list instead of token list

This is actually proposed by @slaren , but I'm not sure what it should look like in real world. Could you please explain it further?

I'm also tagging @ggerganov and @abetlen for futher discussion. Thank you!

The text was updated successfully, but these errors were encountered:

slaren · 2024-11-18T14:25:39Z

This is actually proposed by @slaren , but I'm not sure what it should look like in real world. Could you please explain it further?

struct llama_batch_seq {
    int n_tokens;
    llama_token * token; // only one of token and embd can be non-null
    float * embd;
    enum { none, all, last } logits;
    int seq_id;
    int pos; // -1 = end
};

struct llama_batch {
    int n_seqs;
    llama_batch_seq * seqs;
};

ngxson mentioned this issue Dec 3, 2024

llama : first attempt to implement vision API (WIP) #9687

Draft

7 tasks

github-actions bot added the stale label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Allow adding both tokens and embeddings to `llama_batch` #10381

Refactor: Allow adding both tokens and embeddings to `llama_batch` #10381

ngxson commented Nov 18, 2024

slaren commented Nov 18, 2024

Refactor: Allow adding both tokens and embeddings to llama_batch #10381

Refactor: Allow adding both tokens and embeddings to llama_batch #10381

Comments

ngxson commented Nov 18, 2024

Background Description

Possible Refactor Approaches

Proposal 1: Add n_embds to llama_batch

Proposal 2: Add an overload version of llama_decode/encode

Proposal 3: Keep llama_batch the same, but tokens ID < 0 are embeddings

Proposal 4: Completely refactor llama_batch to accept sequence list instead of token list

slaren commented Nov 18, 2024

Refactor: Allow adding both tokens and embeddings to `llama_batch` #10381

Refactor: Allow adding both tokens and embeddings to `llama_batch` #10381

Proposal 1: Add `n_embds` to `llama_batch`

Proposal 2: Add an overload version of `llama_decode/encode`

Proposal 3: Keep `llama_batch` the same, but tokens ID < 0 are embeddings

Proposal 4: Completely refactor `llama_batch` to accept sequence list instead of token list