Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #7999 (adds control vectors to all build_XXX() functions in llama.cpp [needs testing] #8060

Merged
merged 8 commits into from
Jun 25, 2024
Merged

fixes #7999 (adds control vectors to all build_XXX() functions in llama.cpp [needs testing] #8060

merged 8 commits into from
Jun 25, 2024

Conversation

jukofyork
Copy link
Contributor

@jukofyork jukofyork commented Jun 22, 2024

The `build_command_r` forgot to add the control vector.
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 22, 2024
@mofosyne
Copy link
Collaborator

@jukofyork what's your thought about this other PR #8052 that she made. Is this related to this fix?

@jukofyork jukofyork changed the title fixes #7999 (the build_command_r function forgot to add the control vector) fixes #7999 (adds control vectors to build_command_r() and build_qwen2()) Jun 22, 2024
@jukofyork
Copy link
Contributor Author

I can't actually test the build_qwen2() change as don't have a copy downloaded, but have tested the build_command_r() change and can confirm it's working (doing something anyway). I can't see anything about build_qwen2() that would make it not work, but maybe best if somebody can test it before pushing to be sure...

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 22, 2024

@jukofyork what's your thought about this other PR #8052 that she made. Is this related to this fix?

I'm using my own code to create the control vectors so not really sure how the new cvector-generator.cpp works enough to be sure...

But the fix looks like a good idea as adding a space there can really screw up some models due to the "token bias problem" which causes the models to have to create the first word using rare/out-of-distribution tokens instead of the tokens it has that start with a space.

This was reported yesterday on Reddit as causing the new deepseek-coder-2 models to output Chinese because the Ollama template had an extra space like this, so pretty sure the "token bias problem" is still a thing and "token healing" (IIRC the correct term) is not implemented in llama.cpp yet.

EDIT: My reply assumes their PR is taking about this space: "[INST] Act as if you're extremely happy. [/INST]<HERE>" - I've not read the rest of the code to know what persona and suffix actually are...

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 22, 2024

Actually it looks like most of the build_XXX() functions are missing the ability to add control vectors in llama.cpp?

I can add them all in but don't have the models to test against, so not sure this is a good idea? The only ones I can see that have the code in are:

  • build_llama()
  • build_grok()
  • build_dbrx()
  • build_olmo()
  • build_arctic()

I just added these 2 in this PR:

  • build_command_r()
  • build_qwen2()

but there are dozens of other build_XXX() functions too... Many for models I've never even heard of!

@jukofyork
Copy link
Contributor Author

Plus there's gonna be loads of duplicated code... It would be beter IMO to implement a helper function like the existing llm_build_norm(), llm_build_ffn(), etc and then use this to apply the control vectors instead of copying this snippet into every build_XXX() function.

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 22, 2024

@ggerganov @slaren

Can you guys have a look at this? I'm happy to fix them all but not sure what is the best way to test any of it without the models...

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 22, 2024

I also see:

CI / macOS-latest-cmake-arm64 (pull_request) Failing after 2m

which may be due to this CPU being little-endian? (I've no idea about macs...)

If so, and with the fact so many other build_XXX() functions are also missing their control-vectors code; I think this should be upgraded to medium complexity rather than low (changing the checkbox in the OP didn't do this).

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Jun 22, 2024
@jukofyork
Copy link
Contributor Author

Not related to the above, but still relevant to control vector code:

llama_control_vector_data llama_control_vector_load(const std::vector<llama_control_vector_load_info> & load_infos) {
    llama_control_vector_data result = { -1, {} };

    for (const auto & info : load_infos) {
        auto cur = llama_control_vector_load_one(info);

        if (cur.n_embd == -1) {
            return result;
        }
        if (result.n_embd != -1 && (result.n_embd != cur.n_embd || result.data.size() != cur.data.size())) {
            fprintf(stderr, "%s: control vector in %s does not match previous vector dimensions\n", __func__, info.fname.c_str());
            return result;
        }

        if (result.n_embd == -1) {
            result = std::move(cur);
        } else {
            for (size_t i = 0; i < cur.data.size(); i++) {
                result.data[i] += cur.data[i];
            }
        }
    }

    if (result.n_embd == -1) {
        fprintf(stderr, "%s: no vectors passed\n", __func__);
    }

    return result;
}

This loop really should take the union of control vectors and only sum those that have matching indices. As it is, the result.data.size() != cur.data.size() test stops you doing this and the only workaround is to pad the exported control vectors with zeroed-valued vectors for all layers you don't care about (not ideal).

I can't see an easy way to adapt llama_control_vector_load_one() to be able to do this union due to the for (uint32_t il = 1; il <= max_direction_layer; il++) loop at the end, etc. Perhaps a new function to take the union and sum would be best?

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 22, 2024

A few more thing related to control vectors worth mentioning:


In the Python code: https://github.com/vgel/repeng/blob/main/repeng/extract.py

    @classmethod
    def import_gguf(cls, path: os.PathLike[str] | str) -> "ControlVector":
        reader = gguf.GGUFReader(path)

        archf = reader.get_field("general.architecture")
        if not archf or not len(archf.parts):
            warnings.warn(".gguf file missing architecture field")
        else:
            arch = str(bytes(archf.parts[-1]), encoding="utf-8", errors="replace")
            if arch != "controlvector":
                warnings.warn(
                    f".gguf file with architecture {arch!r} does not appear to be a control vector!"
                )

        modelf = reader.get_field("controlvector.model_hint")
        if not modelf or not len(modelf.parts):
            raise ValueError(".gguf file missing controlvector.model_hint field")
        model_hint = str(bytes(modelf.parts[-1]), encoding="utf-8")

        directions = {}
        for tensor in reader.tensors:
            if not tensor.name.startswith("direction."):
                continue
            try:
                layer = int(tensor.name.split(".")[1])
            except:
                raise ValueError(
                    f".gguf file has invalid direction field name: {tensor.name}"
                )
            directions[layer] = tensor.data

        return cls(model_type=model_hint, directions=directions)

He sets .layer_count to the number of vectors rather than the number of layers in the model? I see this never actaully gets used in the llama.cpp code, but I assume it should be the model layer count and is intended to be used (eventually) as a sanity check that the .model_hint and .layer_count match the model, as the number of control vectors can already be deduced from the internal ggml tensor dimensions?


I'm also not sure if the Python code has an off by 1 error and the layers are all 1 lower than they should be:

    # normalize the layer indexes if they're negative
    n_layers = len(model_layer_list(model))
    hidden_layers = [i if i >= 0 else n_layers + i for i in hidden_layers]

this makes me think he's setting the floor to zero when the llama.cpp code treats layer 0 as the post-embedding/pre-first-block "layer":

    // Apply a loaded control vector to a llama_context, or if data is NULL, clear
    // the currently loaded vector.
    // n_embd should be the size of a single layer's control, and data should point
    // to an n_embd x n_layers buffer starting from layer 1.
    // il_start and il_end are the layer range the vector should apply to (both inclusive)
    // See llama_control_vector_load in common to load a control vector.
    LLAMA_API int32_t llama_control_vector_apply(
            struct llama_context * lctx,
                     const float * data,
                          size_t   len,
                         int32_t   n_embd,
                         int32_t   il_start,
                         int32_t   il_end);

I'm also fairly sure this:

        if positive_smaller_mean > positive_larger_mean:  # type: ignore
            directions[layer] *= -1

should actually just be:

    directions[layer] *= positive_larger_mean - positive_smaller_mean

(or something similar - his code is very hard to understand).

Basically you want to project the datasets onto the direction and scale by the negative mean difference (or equivalently project the differenced dataset and use its mean).

as the Eigenvectors returned by this

            pca_model = PCA(n_components=1, whiten=False).fit(train)
            # shape (n_features,)
            directions[layer] = pca_model.components_.astype(np.float32).squeeze(axis=0)

all have unit norm (I assume umap.UMAP() is the same?)

and from my own code you can see the dramatically different scales of the projected means (ie: abs(ositive_larger_mean - positive_smaller_mean) above) are:

- Layer 1:
-- #1: 21.6% μ = [0.15, 0.15]
- Layer 64:
-- #1: 8.3% μ = [85.75, 11.13]

The means are 2 orders of magnitude different between the first and last layer! This is due to the way the residual stream overwrites the original embeddings as they layers progress (ie: by writing larger and larger values that eventually dwarf the embeddings in the residual addition operations, etc).

You don't actually often find useful control vectors in the very early layers, but even 1/3 of the way through they are 1 order of magnitude lower:

- Layer 20:
-- #1: 9.0% μ = [2.15, 2.24]

and this is going to mean that using a single scale-factor it will be a case of reducing it hugely to not screw the model in the later layers whilst simultaneously not scaling the early-mid layers enough...

I'm not sure if the new llama.cpp native code has copied this sign-only change, but if so it should be looked at.

@HatsuneMikuUwU33
Copy link
Contributor

HatsuneMikuUwU33 commented Jun 22, 2024

I can add them all in but don't have the models to test against, so not sure this is a good idea?

Just add them all in, we will hear later if someone has problems with them.

@jukofyork
Copy link
Contributor Author

I can add them all in but don't have the models to test against, so not sure this is a good idea?

Just add them all in, we will hear later if someone has problems with them.

I'll have a look later - it would be nice to know what the arm64 error is though? Why would that fail when presumably the exact same code snippet worked in the other models?

@jukofyork
Copy link
Contributor Author

features/results.feature:29 consistent results with same seed -- @1.1

That's an odd error to get too.

@slaren
Copy link
Collaborator

slaren commented Jun 22, 2024

Plus there's gonna be loads of duplicated code... It would be beter IMO to implement a helper function like the existing llm_build_norm(), llm_build_ffn(), etc and then use this to apply the control vectors instead of copying this snippet into every build_XXX() function.

It could probably be done in the cb function if cur is changed to take a reference so that it can be modified in the callback, and the layer output tensor is given a consistent name across all the models (it seems a bit of a mess at the moment).

I'll have a look later - it would be nice to know what the arm64 error is though? Why would that fail when presumably the exact same code snippet worked in the other models?

features/results.feature:29 consistent results with same seed -- @1.1

These are known intermittent CI failures and are not related to this PR.

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 22, 2024

Plus there's gonna be loads of duplicated code... It would be beter IMO to implement a helper function like the existing llm_build_norm(), llm_build_ffn(), etc and then use this to apply the control vectors instead of copying this snippet into every build_XXX() function.

It could probably be done in the cb function if cur is changed to take a reference so that it can be modified in the callback, and the layer output tensor is given a consistent name across all the models (it seems a bit of a mess at the moment).

I'll have a look later - it would be nice to know what the arm64 error is though? Why would that fail when presumably the exact same code snippet worked in the other models?

features/results.feature:29 consistent results with same seed -- @1.1

These are known intermittent CI failures and are not related to this PR.

I've just done this for every one:

                if (ggml_tensor * layer_dir = lctx.cvec.tensor_for(il)) {
                    cur = ggml_add(ctx0, cur, layer_dir);
                }
                cb(cur, "l_out", il);

                // input for next layer
                inpL = cur;

as it was easy to search for "l_out", il) and find all the occurrences that way. I also changed any that were doing inpL = ggml_add(... to cur = ggml_add( ... // input for next layer so all consistent now.

It's just compiling now.

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 22, 2024

I also saw there were 2-3 places where cb(cur, "l_out", il) was called twice between two tensor additions, but I wasn't 100% sure if there was a good reason for this and left them alone... I assume this function stands for "callback" and is what gets used for the imatrix code, etc? If so I don't think these 2-3 double calls are likely correct.

@slaren
Copy link
Collaborator

slaren commented Jun 22, 2024

It stands for "callback", but it is a different callback than imatrix and other tools uses. It was used for more things in the past, nowadays it only used to set the name of the tensors and adjust the backends used in some ops.

llama.cpp/llama.cpp

Lines 11772 to 11777 in 3e58b0e

llm_build_cb cb = [&](struct ggml_tensor * cur, const char * name, int il) {
if (il >= 0) {
ggml_format_name(cur, "%s-%d", name, il);
} else {
ggml_set_name(cur, name);
}

@jukofyork
Copy link
Contributor Author

These are known intermittent CI failures and are not related to this PR.

Oh, I thought it might have been related to this:

gguf: This GGUF file is for Little Endian only

Has there been any thought about adding a flag and then casting in/out of Big-Endian and allowing these to work? It's been a long time since I dealt with anything other than x86, but back in the days of the M68000 it was quite common to do this to allow sharing data between Z80, 8080, etc.

@jukofyork
Copy link
Contributor Author

It's compiled and I can confirm that control-r-plus is working again, but I wouldn't be in a rush to push this as I've literally never even heard of half the models.

I've tripled checked the diff and I can't see anything I've screwed up, but it was a lot of changes to make....

@slaren
Copy link
Collaborator

slaren commented Jun 22, 2024

At some point someone added support for big ending that consisted in bumping the gguf version (?). AFAIK there are no significant big endian machines these days and in my opinion it is not worth spending any effort on that.

@jukofyork
Copy link
Contributor Author

It stands for "callback", but it is a different callback than imatrix and other tools uses. It was used for more things in the past, nowadays it only used to set the name of the tensors and adjust the backends used in some ops.

llama.cpp/llama.cpp

Lines 11772 to 11777 in 3e58b0e

llm_build_cb cb = [&](struct ggml_tensor * cur, const char * name, int il) {
if (il >= 0) {
ggml_format_name(cur, "%s-%d", name, il);
} else {
ggml_set_name(cur, name);
}

@slaren Can you check the double calls to cb(cur, "l_out", il) and see if they should be there? It's probably a good time to fix them if they aren't... Just searching this PR's code for ""l_out", il)" should show them quickly.

@jukofyork
Copy link
Contributor Author

At some point someone added support for big ending that consisted in bumping the gguf version (?). AFAIK there are no significant big endian machines these days and in my opinion it is not worth spending any effort on that.

Yeah, I've never come across a Big-Endian CPU since the the M68000 days, but thought it was possible some of the Mac's CPUs
still used them due to that test failing.

@slaren
Copy link
Collaborator

slaren commented Jun 22, 2024

l_out is just a name, it's only relevance is to help identify the operation during debugging AND also in the control vector generation example. So basically at this point, the name l_out is only really relevant for control vectors, so it would be ok to normalize the usage of this name in all the models in whatever way suits the control vector feature.

@jukofyork
Copy link
Contributor Author

l_out is just a name, it's only relevance is to help identify the operation during debugging AND also in the control vector generation example. So basically at this point, the name l_out is only really relevant for control vectors, so it would be ok to normalize the usage of this name in all the models in whatever way suits the control vector feature.

I removed the double calls: f393795

@jukofyork jukofyork changed the title fixes #7999 (adds control vectors to build_command_r() and build_qwen2()) fixes #7999 (adds control vectors to all build_XXX() functions in llama.cpp [needs testing] Jun 22, 2024
@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 25, 2024

I might have a crack at refactoring this properly this week. So instead of just copying this over and over:

if (layer_dir != nullptr) {
    cur = ggml_add(ctx0, cur, layer_dir);
}

I could unify the control vectors, (runtime) orthogonal projection onto a subspace (what people are calling "ablation" or "abliteration"), and the Householder Transform (ie: reflection over a hyperplane to "invert" the models' concepts: it's just - 2*... instead of - 1*... in the formula below).

image

It will only be O(n) if we use (𝐯·𝐯T)·𝐮 = 𝐯·(𝐯T·𝐮) = 𝐯(𝐯T·𝐮) and don't create the outer product, so not really any more expensive than the current code:

@slaren @ggerganov What are the correct ggml functions to use for:

  1. Dot product c = 𝐯T·𝐮.
  2. Multiply by a scaler c𝐯.
  3. Subtract (I could negate and use ggml_add() I guess).

I will then wrap this up in a similar function to llm_build_norm() or llm_build_ffn() that gets called at the bottom for every model right before the cb(cur, "l_out", il) callback (and wrap the nullptr test, etc in this function).

I will also look into refactoring the code that combines (ie: sums) the control vectors so it doesn't need padding with zero vectors any more.

Any comments or suggestions on doing this?

@slaren
Copy link
Collaborator

slaren commented Jun 25, 2024

  • Dot product c = 𝐯T·𝐮.

There isn't a dot product function by itself, but you can use ggml_mul_mat with vectors to achieve the same.

  • Multiply by a scaler c𝐯.

ggml_scale if the scale is constant, ggml_mul otherwise.

  • Subtract (I could negate and use ggml_add() I guess).

There is ggml_sub, but it is not supported in most GPU backends, so ggml_add with ggml_scale can be used for better compatibility.

@jukofyork
Copy link
Contributor Author

  • Dot product c = 𝐯T·𝐮.

There isn't a dot product function by itself, but you can use ggml_mul_mat with vectors to achieve the same.

  • Multiply by a scaler c𝐯.

ggml_scale if the scale is constant, ggml_mul otherwise.

  • Subtract (I could negate and use ggml_add() I guess).

There is ggml_sub, but it is not supported in most GPU backends, so ggml_add with ggml_scale can be used for better compatibility.

Thanks - I'll have a look and try to familiarize myself with the functions first. IIRC, there is some gotcha with the ggml matrices needing transposing but can't see the page that showed the format now :/

@slaren
Copy link
Collaborator

slaren commented Jun 25, 2024

IIRC, there is some gotcha with the ggml matrices needing transposing but can't see the page that showed the format now :/

Here is one explanation: https://github.com/ggerganov/llama.cpp/blob/master/README.md#coding-guidelines

@jukofyork
Copy link
Contributor Author

IIRC, there is some gotcha with the ggml matrices needing transposing but can't see the page that showed the format now :/

Here is one explanation: https://github.com/ggerganov/llama.cpp/blob/master/README.md#coding-guidelines

Yeah, that is it thanks!

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 25, 2024

So just throwing out some ideas now and would be happy for any input or feedback:

The least bad / least confusing name for this I can think of is "control-projector" and following the --control-vector commands:

         --control-vector FNAME   add a control vector
         --control-vector-scaled FNAME SCALE
                                  add a control vector with user defined scaling SCALE
         --control-vector-layer-range START END
                                  layer range to apply the control vector(s) to, start and end inclusive

I propose these new command line options:

         --control-projector FNAME
         --control-projector-scaled FNAME SCALE
         --control-projector-layer-range START END

The -scale option looks to be inherited from --lora-scaled FNAME S and looks good.

The default value for -control-projector FNAME should probably be 1.0 (or possibly -1.0, but I think 1.0 would be better and less likely to confuse people...) and then coincide with the most likely use of this: orthogonal projection onto the subscape (ie: "collapse", "ablate" , "abliterate", ...), but also let people:

  • Shrink (but not completely collapse) a subspace via setting in the range (0.0, 1.0). This seem intuitive so that "lower value --> shrink less" I think?
  • Reflect by setting to 2.0 (or even partially reflect and scale like the MopeyMule model did by using 1.3).
  • Expand by setting < 0.0 (NOTE: This is not the same as the "induce refusal" method from the paper though!). This may be slightly confusing and counter-intuitive... The alternative is to rejig the formula and have "negative --> reflect" and "> 1.0 --> expand"? Mathematically this seems confusing to me though as "being close to 0" implies "shrink less" doesn't really make me think that "being over 1" implies expand... This needs thinking about carefully I think.

The separate --control-vector-layer-range and --control-projector-layer-range settings are not so good IMO :

  1. There is no good reason to have both and maybe just use --control-layer-range?
  2. Just being able to set a single layer range like this might or might not be quite limiting?

So for (2) we could:

  • Try to add other commands that allow for scaling and/or layer ranges to be specified, but I think this is going to get ugly quickly...
  • Add optional layer ranges, eg: --control-vector-scaled FNAME SCALE <START END>, but again I can't see any other CLI options like this...
  • Create a new program in examples that can be used to manipulate the gguf control vectors in a similar way to the original repeng Python code; including arithmetic operations, etc too.

Any feedback, input or other ideas would be very appreciated! :)

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 25, 2024

Maybe this is the least confusing:

1 --> Do nothing (ie: multiplicative identity; just as the 'control vectors' use 0 as their additive identity)
0 --> Collapse (ie: multiplicative zero) [default]
-1 --> Reflect

(0, 1) --> Shrink
(1, ∞) --> Expand

(-1, 0) --> Reflect and shrink
(-∞, -1) --> Reflect and expand

Having 0 as the default does likely coincide with peoples' intuitive notion of "shrinking to nothing" better...

@jukofyork
Copy link
Contributor Author

I've made a start by moving the logic intollama_control_vector:

struct llama_control_vector {
    std::vector<struct ggml_tensor *> tensors; // per layer
    std::vector<struct ggml_context *> ctxs;
    std::vector<ggml_backend_buffer_t> bufs;

    int32_t layer_start = -1;
    int32_t layer_end   = -1;

    struct ggml_tensor * tensor_for(int il) const {
        if (il < 0 || il < layer_start || il > layer_end || (size_t) il >= tensors.size()) {
            return nullptr;
        }
        return tensors[il];
    }

    struct ggml_tensor * apply_to(struct ggml_context * ctx, struct ggml_tensor * cur, int  il) const {
        ggml_tensor * layer_dir = tensor_for(il);
        if (layer_dir != nullptr) {
            cur = ggml_add(ctx, cur, layer_dir);
        }
        return cur;
    }

    ~llama_control_vector() {
        for (struct ggml_context * ctx : ctxs) {
            ggml_free(ctx);
        }
        for (ggml_backend_buffer_t buf : bufs) {
            ggml_backend_buffer_free(buf);
        }
    }
};

Which then means each cb(cur, "l_out", il) just needs this:

            cur = lctx.cvec.apply_to(ctx0, cur, il);
            cb(cur, "l_out", il);

There was already a llama_control_vector_apply function called from common.cpp:

    if (!params.control_vectors.empty()) {
        if (params.control_vector_layer_start <= 0) params.control_vector_layer_start = 1;
        if (params.control_vector_layer_end   <= 0) params.control_vector_layer_end   = llama_n_layer(model);

        const auto cvec = llama_control_vector_load(params.control_vectors);
        if (cvec.n_embd == -1) {
            llama_free(lctx);
            llama_free_model(model);
            return std::make_tuple(nullptr, nullptr);
        }

        int err = llama_control_vector_apply(lctx,
                                             cvec.data.data(),
                                             cvec.data.size(),
                                             cvec.n_embd,
                                             params.control_vector_layer_start,
                                             params.control_vector_layer_end);
        if (err) {
            llama_free(lctx);
            llama_free_model(model);
            return std::make_tuple(nullptr, nullptr);
        }
    }

So thought it was best to just add another member to llama_control_vector rather than add to the confusion.

@jukofyork
Copy link
Contributor Author

I think the simplest way forward to add the "control projections" is to duplicate the code for everything related to llama_control_vector as llama_control_projection, and then at the end I should have more idea about the layout and the inner workings and can look to merge the two.

Should I make a new PR for this or keep working in this PR?

@slaren
Copy link
Collaborator

slaren commented Jun 25, 2024

Since this fixes an immediate issue, I think it is better to merge this and continue development of new features in a different PR.

@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 25, 2024

Since this fixes an immediate issue, I think it is better to merge this and continue development of new features in a different PR.

No problem - I'll make a new PR tomorrow to allow taking the union of control vectors as that too fixes an existing problem and then look at adding the new stuff in another PR,

@slaren slaren merged commit 163d50a into ggerganov:master Jun 25, 2024
63 checks passed
@jukofyork jukofyork deleted the jukofyork-command_r-control-vector-fix branch June 25, 2024 21:04
@jukofyork
Copy link
Contributor Author

jukofyork commented Jun 26, 2024

@slaren Sorry to bother you again:

static llama_control_vector_data llama_control_vector_load_one(const llama_control_vector_load_info & load_info) {
    int32_t n_tensors;

    size_t n_bytes = 0;

    uint32_t max_direction_layer = 0;

    llama_control_vector_data result = { -1, {} };

    // calculate size of ctx needed for tensors, ensure tensors are f32, and find max layer
    {
        struct ggml_init_params meta_params = {
            /* .mem_size   = */ ggml_tensor_overhead() * 128 + ggml_graph_overhead(),
            /* .mem_buffer = */ nullptr,
            /* .no_alloc   = */ true,
        };
        ggml_context * meta_ctx = ggml_init(meta_params);
        struct gguf_init_params meta_gguf_params = {
            /* .no_alloc = */ true,
            /* .ctx      = */ &meta_ctx,
        };
        struct gguf_context * meta_ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), meta_gguf_params);
        if (!meta_ctx_gguf) {
            fprintf(stderr, "%s: failed to load control vector from %s\n", __func__, load_info.fname.c_str());
            ggml_free(meta_ctx);
            return result;
        }

        n_tensors = gguf_get_n_tensors(meta_ctx_gguf);
        for (int i = 0; i < n_tensors; i++) {
            std::string name = gguf_get_tensor_name(meta_ctx_gguf, i);

            // split on '.'
            size_t dotpos = name.find('.');
            if (dotpos != std::string::npos && name.substr(0, dotpos) == "direction") {
                try {
                    uint32_t layer = std::stoi(name.substr(dotpos + 1));
                    if (layer == 0) {
                        fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
                        ggml_free(meta_ctx);
                        gguf_free(meta_ctx_gguf);
                        return result;
                    }
                    if (layer > max_direction_layer) {
                        max_direction_layer = layer;
                    }
                } catch (...) {
                    fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
                    ggml_free(meta_ctx);
                    gguf_free(meta_ctx_gguf);
                    return result;
                }
            }

            struct ggml_tensor * tensor_meta = ggml_get_tensor(meta_ctx, name.c_str());
            if (tensor_meta->type != GGML_TYPE_F32 || ggml_n_dims(tensor_meta) != 1) {
                fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
                ggml_free(meta_ctx);
                gguf_free(meta_ctx_gguf);
                return result;
            }
            if (result.n_embd == -1) {
                result.n_embd = ggml_nelements(tensor_meta);
            } else if (ggml_nelements(tensor_meta) != result.n_embd) {
                fprintf(stderr, "%s: direction tensor sizes mismatched in %s\n", __func__, load_info.fname.c_str());
                ggml_free(meta_ctx);
                gguf_free(meta_ctx_gguf);
                return result;
            }
            n_bytes += ggml_nbytes(tensor_meta);
        }
        ggml_free(meta_ctx);
        gguf_free(meta_ctx_gguf);
    }

    if (n_tensors == 0) {
        fprintf(stderr, "%s: no direction tensors found in %s\n", __func__, load_info.fname.c_str());
        return result;
    }

    // load and scale tensors into final control vector context
    struct ggml_init_params ggml_params = {
        /* .mem_size   = */ ggml_tensor_overhead() * n_tensors + n_bytes,
        /* .mem_buffer = */ nullptr,
        /* .no_alloc   = */ false,
    };
    struct ggml_context * ctx = ggml_init(ggml_params);

    struct gguf_init_params params = {
        /*.no_alloc = */ false,
        /*.ctx      = */ &ctx,
    };
    struct gguf_context * ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), params);
    if (!ctx_gguf) {
        fprintf(stderr, "%s: failed to load control vector from %s\n", __func__, load_info.fname.c_str());
        ggml_free(ctx);
        return result;
    }

    // do not store data for layer 0 (it's not used)
    result.data.resize(result.n_embd * max_direction_layer);

    for (uint32_t il = 1; il <= max_direction_layer; il++) {
        const std::string name = "direction." + std::to_string(il);
        const ggml_tensor * tensor = ggml_get_tensor(ctx, name.c_str());

        float * dst = result.data.data() + result.n_embd * (il - 1);

        if (tensor) {
            const float * src = (const float *) tensor->data;
            for (int j = 0; j < result.n_embd; j++) {
                dst[j] = src[j] * load_info.strength;
            }
        } else {
            for (int j = 0; j < result.n_embd; j++) {
                dst[j] = 0.0f;
            }
        }
    }

    ////ggml_free(ctx);
    ////gguf_free(ctx_gguf);
    return result;
}

Is the ctx and ctx_gguf not just local to this function and therefore should be freed before the last return (see my ////) or is there some global side effect? The "load and scale tensors into final control vector context" comment and the fact they are not freed makes it look like there is, but I don't see how it could be used again as the scaling dst[j] = src[j] * load_info.strength and summing result.data[i] += cur.data[i] both happen to the llama_control_vector_data.data member and I can't see anywhere where the float * from the .data member is assigned/aliased?

So assuming the above is correct and ctx and ctx_gguf are local to llama_control_vector_load_one(), then it looks like it's just a 1-line fix to extend the vector as needed:

llama_control_vector_data llama_control_vector_load(const std::vector<llama_control_vector_load_info> & load_infos) {
    llama_control_vector_data result = { -1, {} };

    for (const auto & info : load_infos) {
        auto cur = llama_control_vector_load_one(info);

        if (cur.n_embd == -1) {
            return result;
        }
        if (result.n_embd != -1 && result.n_embd != cur.n_embd) {
            fprintf(stderr, "%s: control vector in %s does not match previous vector dimensions\n", __func__, info.fname.c_str());
            return result;
        }

        if (result.n_embd == -1) {
            result = std::move(cur);
        } else {
            ////result.data.resize(std::max(result.data.size(), cur.data.size()), 0.0f);  // extend if necessary
            for (size_t i = 0; i < cur.data.size(); i++) {
                result.data[i] += cur.data[i];
            }
        }
    }

    if (result.n_embd == -1) {
        fprintf(stderr, "%s: no vectors passed\n", __func__);
    }

    return result;
}

Also (assuming the above is correct), is llama.cpp generally against hard exits on failure of things like loading the first control vector then finding the second is a mismatch? As a user, I found it quite confusing and didn't see to start with that only 1 of my 2 control vectors were getting loaded because of the return result here:

        if (cur.n_embd == -1) {
            return result;
        }
        if (result.n_embd != -1 && result.n_embd != cur.n_embd) {
            fprintf(stderr, "%s: control vector in %s does not match previous vector dimensions\n", __func__, info.fname.c_str());
            return result;
        }

If hard exits are discouraged / not allowed, then perhaps a continue would be better to try loading the rest of the control vectors?

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2024

This function has several leaks, and at least some of them were fixed in #6289, but that was never merged. It would be great to merge at least the fixes from that PR.

The core llama.cpp library should not crash on bad inputs, but this code is only part of the examples and not the core library, and it doesn't need to conform to the same requirements. However, in practice these functions are necessary to use control vectors, and in my opinion eventually should be moved to llama.cpp.

@jukofyork
Copy link
Contributor Author

This function has several leaks, and at least some of them were fixed in #6289, but that was never merged. It would be great to merge at least the fixes from that PR.

The core llama.cpp library should not crash on bad inputs, but this code is only part of the examples and not the core library, and it doesn't need to conform to the same requirements. However, in practice these functions are necessary to use control vectors, and in my opinion eventually should be moved to llama.cpp.

Thanks, I'll have a look through the other PR and see if I can incorporate any of the leaks they fixed.

I'll start a new PR for this now.

@jukofyork
Copy link
Contributor Author

See: #8137

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
…ions in `llama.cpp` [needs testing] (ggerganov#8060)

* fixes ggerganov#7999

The `build_command_r` forgot to add the control vector.

* Fixes qwen2 too

* Fixed all models' control vectors

* Removed double calls to `cb(cur, "l_out", il)`

* Moved control vector logic to llama_control_vector:apply_to()
MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024
…ions in `llama.cpp` [needs testing] (ggerganov#8060)

* fixes ggerganov#7999

The `build_command_r` forgot to add the control vector.

* Fixes qwen2 too

* Fixed all models' control vectors

* Removed double calls to `cb(cur, "l_out", il)`

* Moved control vector logic to llama_control_vector:apply_to()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: UwU Emergency! Control Vectors for Qwen2 and Command-r Models Need Fixing!
4 participants