Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: Allow adding both tokens and embeddings to llama_batch #10381

Open
ngxson opened this issue Nov 18, 2024 · 1 comment
Open

Refactor: Allow adding both tokens and embeddings to llama_batch #10381

ngxson opened this issue Nov 18, 2024 · 1 comment
Labels

Comments

@ngxson
Copy link
Collaborator

ngxson commented Nov 18, 2024

Background Description

Ref: #7553 , required for supporting future vision models (#8010)

I initially planned to make a proposal PR for this, but turns out it's quite more complicated than I thought. Therefore, I create this issue for further discussion before actually implement it.

Possible Refactor Approaches

The problem can be divided into 2 parts:

  1. How the llama_batch can be constructed?
  2. How the cgraph should be modified?

For the second part (How the cgraph should be modified?), it should be simple: llm_build_inp_embd can be modified to concat tensors from "learned" embd and input embd. The pseudo-code looks like this:

lctx.inp_tokens = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, batch.n_tokens);
struct ggml_tensor * learned_embd = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);
struct ggml_tensor * inp_embd     = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, batch.n_tokens);
inpL = ggml_concat(ctx, learned_embd, inp_embd);

The attention mask also need to be updated accordingly.


For the first part (How the llama_batch can be constructed?), the problem is that there are many different possible approach:

Proposal 1: Add n_embds to llama_batch

    typedef struct llama_batch {
        int32_t n_tokens;
        int32_t n_embds;

        llama_token  *  token;
        float        *  embd; // has n_embds * dim_embd elements
        int32_t      *  n_seq_id; // has n_tokens+n_embds elements
        ...
    }

The downside of this approach is that it's quite messy to keep track of n_seq_id, seq_id, logits

Proposal 2: Add an overload version of llama_decode/encode

llama_decode(struct llama_context * ctx, struct llama_batch * batch_tokens, struct llama_batch * batch_embd);

The downside would be that this is kinda a "hacky" (not intuitive for developers), because one batch is now represented by 2 llama_batch objects.

Proposal 3: Keep llama_batch the same, but tokens ID < 0 are embeddings

For example:

batch.token = { 1, 842, -1, -1, 242, -1 };
batch.embd = {
  -0.4, 0.2, 0.12,..., // correspond to batch.token[2]
  0.04, 0.02, 0.3,..., // correspond to batch.token[3]
  0.04, 0.1, -0.3,..., // correspond to batch.token[5]
};

This seems to be easier to implement than all other proposals. The only thing I'm not sure is that do we expect negative token ID to be a reserved use case?

Proposal 4: Completely refactor llama_batch to accept sequence list instead of token list

This is actually proposed by @slaren , but I'm not sure what it should look like in real world. Could you please explain it further?

I'm also tagging @ggerganov and @abetlen for futher discussion. Thank you!

@slaren
Copy link
Collaborator

slaren commented Nov 18, 2024

This is actually proposed by @slaren , but I'm not sure what it should look like in real world. Could you please explain it further?

struct llama_batch_seq {
    int n_tokens;
    llama_token * token; // only one of token and embd can be non-null
    float * embd;
    enum { none, all, last } logits;
    int seq_id;
    int pos; // -1 = end
};

struct llama_batch {
    int n_seqs;
    llama_batch_seq * seqs;
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants