-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes #7999 (adds control vectors to all build_XXX()
functions in llama.cpp
[needs testing]
#8060
fixes #7999 (adds control vectors to all build_XXX()
functions in llama.cpp
[needs testing]
#8060
Conversation
jukofyork
commented
Jun 22, 2024
•
edited by mofosyne
Loading
edited by mofosyne
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
@jukofyork what's your thought about this other PR #8052 that she made. Is this related to this fix? |
build_command_r
function forgot to add the control vector)build_command_r()
and build_qwen2()
)
I can't actually test the |
I'm using my own code to create the control vectors so not really sure how the new But the fix looks like a good idea as adding a space there can really screw up some models due to the "token bias problem" which causes the models to have to create the first word using rare/out-of-distribution tokens instead of the tokens it has that start with a space. This was reported yesterday on Reddit as causing the new EDIT: My reply assumes their PR is taking about this space: "[INST] Act as if you're extremely happy. [/INST]<HERE>" - I've not read the rest of the code to know what |
Actually it looks like most of the I can add them all in but don't have the models to test against, so not sure this is a good idea? The only ones I can see that have the code in are:
I just added these 2 in this PR:
but there are dozens of other |
Plus there's gonna be loads of duplicated code... It would be beter IMO to implement a helper function like the existing |
Can you guys have a look at this? I'm happy to fix them all but not sure what is the best way to test any of it without the models... |
I also see:
which may be due to this CPU being little-endian? (I've no idea about macs...) If so, and with the fact so many other |
Not related to the above, but still relevant to control vector code: llama_control_vector_data llama_control_vector_load(const std::vector<llama_control_vector_load_info> & load_infos) {
llama_control_vector_data result = { -1, {} };
for (const auto & info : load_infos) {
auto cur = llama_control_vector_load_one(info);
if (cur.n_embd == -1) {
return result;
}
if (result.n_embd != -1 && (result.n_embd != cur.n_embd || result.data.size() != cur.data.size())) {
fprintf(stderr, "%s: control vector in %s does not match previous vector dimensions\n", __func__, info.fname.c_str());
return result;
}
if (result.n_embd == -1) {
result = std::move(cur);
} else {
for (size_t i = 0; i < cur.data.size(); i++) {
result.data[i] += cur.data[i];
}
}
}
if (result.n_embd == -1) {
fprintf(stderr, "%s: no vectors passed\n", __func__);
}
return result;
} This loop really should take the union of control vectors and only sum those that have matching indices. As it is, the I can't see an easy way to adapt |
A few more thing related to control vectors worth mentioning: In the Python code: https://github.com/vgel/repeng/blob/main/repeng/extract.py @classmethod
def import_gguf(cls, path: os.PathLike[str] | str) -> "ControlVector":
reader = gguf.GGUFReader(path)
archf = reader.get_field("general.architecture")
if not archf or not len(archf.parts):
warnings.warn(".gguf file missing architecture field")
else:
arch = str(bytes(archf.parts[-1]), encoding="utf-8", errors="replace")
if arch != "controlvector":
warnings.warn(
f".gguf file with architecture {arch!r} does not appear to be a control vector!"
)
modelf = reader.get_field("controlvector.model_hint")
if not modelf or not len(modelf.parts):
raise ValueError(".gguf file missing controlvector.model_hint field")
model_hint = str(bytes(modelf.parts[-1]), encoding="utf-8")
directions = {}
for tensor in reader.tensors:
if not tensor.name.startswith("direction."):
continue
try:
layer = int(tensor.name.split(".")[1])
except:
raise ValueError(
f".gguf file has invalid direction field name: {tensor.name}"
)
directions[layer] = tensor.data
return cls(model_type=model_hint, directions=directions) He sets I'm also not sure if the Python code has an off by 1 error and the layers are all 1 lower than they should be: # normalize the layer indexes if they're negative
n_layers = len(model_layer_list(model))
hidden_layers = [i if i >= 0 else n_layers + i for i in hidden_layers] this makes me think he's setting the floor to zero when the // Apply a loaded control vector to a llama_context, or if data is NULL, clear
// the currently loaded vector.
// n_embd should be the size of a single layer's control, and data should point
// to an n_embd x n_layers buffer starting from layer 1.
// il_start and il_end are the layer range the vector should apply to (both inclusive)
// See llama_control_vector_load in common to load a control vector.
LLAMA_API int32_t llama_control_vector_apply(
struct llama_context * lctx,
const float * data,
size_t len,
int32_t n_embd,
int32_t il_start,
int32_t il_end); I'm also fairly sure this: if positive_smaller_mean > positive_larger_mean: # type: ignore
directions[layer] *= -1 should actually just be: directions[layer] *= positive_larger_mean - positive_smaller_mean (or something similar - his code is very hard to understand). Basically you want to project the datasets onto the direction and scale by the negative mean difference (or equivalently project the differenced dataset and use its mean). as the Eigenvectors returned by this pca_model = PCA(n_components=1, whiten=False).fit(train)
# shape (n_features,)
directions[layer] = pca_model.components_.astype(np.float32).squeeze(axis=0) all have unit norm (I assume and from my own code you can see the dramatically different scales of the projected means (ie:
The means are 2 orders of magnitude different between the first and last layer! This is due to the way the residual stream overwrites the original embeddings as they layers progress (ie: by writing larger and larger values that eventually dwarf the embeddings in the residual addition operations, etc). You don't actually often find useful control vectors in the very early layers, but even 1/3 of the way through they are 1 order of magnitude lower:
and this is going to mean that using a single scale-factor it will be a case of reducing it hugely to not screw the model in the later layers whilst simultaneously not scaling the early-mid layers enough... I'm not sure if the new |
Just add them all in, we will hear later if someone has problems with them. |
I'll have a look later - it would be nice to know what the arm64 error is though? Why would that fail when presumably the exact same code snippet worked in the other models? |
That's an odd error to get too. |
It could probably be done in the
These are known intermittent CI failures and are not related to this PR. |
I've just done this for every one: if (ggml_tensor * layer_dir = lctx.cvec.tensor_for(il)) {
cur = ggml_add(ctx0, cur, layer_dir);
}
cb(cur, "l_out", il);
// input for next layer
inpL = cur; as it was easy to search for It's just compiling now. |
I also saw there were 2-3 places where |
It stands for "callback", but it is a different callback than imatrix and other tools uses. It was used for more things in the past, nowadays it only used to set the name of the tensors and adjust the backends used in some ops. Lines 11772 to 11777 in 3e58b0e
|
Oh, I thought it might have been related to this:
Has there been any thought about adding a flag and then casting in/out of Big-Endian and allowing these to work? It's been a long time since I dealt with anything other than x86, but back in the days of the M68000 it was quite common to do this to allow sharing data between Z80, 8080, etc. |
It's compiled and I can confirm that I've tripled checked the diff and I can't see anything I've screwed up, but it was a lot of changes to make.... |
At some point someone added support for big ending that consisted in bumping the gguf version (?). AFAIK there are no significant big endian machines these days and in my opinion it is not worth spending any effort on that. |
@slaren Can you check the double calls to |
Yeah, I've never come across a Big-Endian CPU since the the M68000 days, but thought it was possible some of the Mac's CPUs |
|
I removed the double calls: f393795 |
build_command_r()
and build_qwen2()
)build_XXX()
functions in llama.cpp
[needs testing]
I might have a crack at refactoring this properly this week. So instead of just copying this over and over: if (layer_dir != nullptr) {
cur = ggml_add(ctx0, cur, layer_dir);
} I could unify the control vectors, (runtime) orthogonal projection onto a subspace (what people are calling "ablation" or "abliteration"), and the Householder Transform (ie: reflection over a hyperplane to "invert" the models' concepts: it's just It will only be O(n) if we use @slaren @ggerganov What are the correct
I will then wrap this up in a similar function to I will also look into refactoring the code that combines (ie: sums) the control vectors so it doesn't need padding with zero vectors any more. Any comments or suggestions on doing this? |
There isn't a dot product function by itself, but you can use
There is |
Thanks - I'll have a look and try to familiarize myself with the functions first. IIRC, there is some gotcha with the |
Here is one explanation: https://github.com/ggerganov/llama.cpp/blob/master/README.md#coding-guidelines |
Yeah, that is it thanks! |
So just throwing out some ideas now and would be happy for any input or feedback: The least bad / least confusing name for this I can think of is "control-projector" and following the
I propose these new command line options:
The The default value for
The separate
So for (2) we could:
Any feedback, input or other ideas would be very appreciated! :) |
Maybe this is the least confusing:
Having |
I've made a start by moving the logic into struct llama_control_vector {
std::vector<struct ggml_tensor *> tensors; // per layer
std::vector<struct ggml_context *> ctxs;
std::vector<ggml_backend_buffer_t> bufs;
int32_t layer_start = -1;
int32_t layer_end = -1;
struct ggml_tensor * tensor_for(int il) const {
if (il < 0 || il < layer_start || il > layer_end || (size_t) il >= tensors.size()) {
return nullptr;
}
return tensors[il];
}
struct ggml_tensor * apply_to(struct ggml_context * ctx, struct ggml_tensor * cur, int il) const {
ggml_tensor * layer_dir = tensor_for(il);
if (layer_dir != nullptr) {
cur = ggml_add(ctx, cur, layer_dir);
}
return cur;
}
~llama_control_vector() {
for (struct ggml_context * ctx : ctxs) {
ggml_free(ctx);
}
for (ggml_backend_buffer_t buf : bufs) {
ggml_backend_buffer_free(buf);
}
}
}; Which then means each cur = lctx.cvec.apply_to(ctx0, cur, il);
cb(cur, "l_out", il); There was already a if (!params.control_vectors.empty()) {
if (params.control_vector_layer_start <= 0) params.control_vector_layer_start = 1;
if (params.control_vector_layer_end <= 0) params.control_vector_layer_end = llama_n_layer(model);
const auto cvec = llama_control_vector_load(params.control_vectors);
if (cvec.n_embd == -1) {
llama_free(lctx);
llama_free_model(model);
return std::make_tuple(nullptr, nullptr);
}
int err = llama_control_vector_apply(lctx,
cvec.data.data(),
cvec.data.size(),
cvec.n_embd,
params.control_vector_layer_start,
params.control_vector_layer_end);
if (err) {
llama_free(lctx);
llama_free_model(model);
return std::make_tuple(nullptr, nullptr);
}
} So thought it was best to just add another member to |
I think the simplest way forward to add the "control projections" is to duplicate the code for everything related to Should I make a new PR for this or keep working in this PR? |
Since this fixes an immediate issue, I think it is better to merge this and continue development of new features in a different PR. |
No problem - I'll make a new PR tomorrow to allow taking the union of control vectors as that too fixes an existing problem and then look at adding the new stuff in another PR, |
@slaren Sorry to bother you again: static llama_control_vector_data llama_control_vector_load_one(const llama_control_vector_load_info & load_info) {
int32_t n_tensors;
size_t n_bytes = 0;
uint32_t max_direction_layer = 0;
llama_control_vector_data result = { -1, {} };
// calculate size of ctx needed for tensors, ensure tensors are f32, and find max layer
{
struct ggml_init_params meta_params = {
/* .mem_size = */ ggml_tensor_overhead() * 128 + ggml_graph_overhead(),
/* .mem_buffer = */ nullptr,
/* .no_alloc = */ true,
};
ggml_context * meta_ctx = ggml_init(meta_params);
struct gguf_init_params meta_gguf_params = {
/* .no_alloc = */ true,
/* .ctx = */ &meta_ctx,
};
struct gguf_context * meta_ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), meta_gguf_params);
if (!meta_ctx_gguf) {
fprintf(stderr, "%s: failed to load control vector from %s\n", __func__, load_info.fname.c_str());
ggml_free(meta_ctx);
return result;
}
n_tensors = gguf_get_n_tensors(meta_ctx_gguf);
for (int i = 0; i < n_tensors; i++) {
std::string name = gguf_get_tensor_name(meta_ctx_gguf, i);
// split on '.'
size_t dotpos = name.find('.');
if (dotpos != std::string::npos && name.substr(0, dotpos) == "direction") {
try {
uint32_t layer = std::stoi(name.substr(dotpos + 1));
if (layer == 0) {
fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
ggml_free(meta_ctx);
gguf_free(meta_ctx_gguf);
return result;
}
if (layer > max_direction_layer) {
max_direction_layer = layer;
}
} catch (...) {
fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
ggml_free(meta_ctx);
gguf_free(meta_ctx_gguf);
return result;
}
}
struct ggml_tensor * tensor_meta = ggml_get_tensor(meta_ctx, name.c_str());
if (tensor_meta->type != GGML_TYPE_F32 || ggml_n_dims(tensor_meta) != 1) {
fprintf(stderr, "%s: direction tensor invalid in %s\n", __func__, load_info.fname.c_str());
ggml_free(meta_ctx);
gguf_free(meta_ctx_gguf);
return result;
}
if (result.n_embd == -1) {
result.n_embd = ggml_nelements(tensor_meta);
} else if (ggml_nelements(tensor_meta) != result.n_embd) {
fprintf(stderr, "%s: direction tensor sizes mismatched in %s\n", __func__, load_info.fname.c_str());
ggml_free(meta_ctx);
gguf_free(meta_ctx_gguf);
return result;
}
n_bytes += ggml_nbytes(tensor_meta);
}
ggml_free(meta_ctx);
gguf_free(meta_ctx_gguf);
}
if (n_tensors == 0) {
fprintf(stderr, "%s: no direction tensors found in %s\n", __func__, load_info.fname.c_str());
return result;
}
// load and scale tensors into final control vector context
struct ggml_init_params ggml_params = {
/* .mem_size = */ ggml_tensor_overhead() * n_tensors + n_bytes,
/* .mem_buffer = */ nullptr,
/* .no_alloc = */ false,
};
struct ggml_context * ctx = ggml_init(ggml_params);
struct gguf_init_params params = {
/*.no_alloc = */ false,
/*.ctx = */ &ctx,
};
struct gguf_context * ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), params);
if (!ctx_gguf) {
fprintf(stderr, "%s: failed to load control vector from %s\n", __func__, load_info.fname.c_str());
ggml_free(ctx);
return result;
}
// do not store data for layer 0 (it's not used)
result.data.resize(result.n_embd * max_direction_layer);
for (uint32_t il = 1; il <= max_direction_layer; il++) {
const std::string name = "direction." + std::to_string(il);
const ggml_tensor * tensor = ggml_get_tensor(ctx, name.c_str());
float * dst = result.data.data() + result.n_embd * (il - 1);
if (tensor) {
const float * src = (const float *) tensor->data;
for (int j = 0; j < result.n_embd; j++) {
dst[j] = src[j] * load_info.strength;
}
} else {
for (int j = 0; j < result.n_embd; j++) {
dst[j] = 0.0f;
}
}
}
////ggml_free(ctx);
////gguf_free(ctx_gguf);
return result;
} Is the So assuming the above is correct and llama_control_vector_data llama_control_vector_load(const std::vector<llama_control_vector_load_info> & load_infos) {
llama_control_vector_data result = { -1, {} };
for (const auto & info : load_infos) {
auto cur = llama_control_vector_load_one(info);
if (cur.n_embd == -1) {
return result;
}
if (result.n_embd != -1 && result.n_embd != cur.n_embd) {
fprintf(stderr, "%s: control vector in %s does not match previous vector dimensions\n", __func__, info.fname.c_str());
return result;
}
if (result.n_embd == -1) {
result = std::move(cur);
} else {
////result.data.resize(std::max(result.data.size(), cur.data.size()), 0.0f); // extend if necessary
for (size_t i = 0; i < cur.data.size(); i++) {
result.data[i] += cur.data[i];
}
}
}
if (result.n_embd == -1) {
fprintf(stderr, "%s: no vectors passed\n", __func__);
}
return result;
} Also (assuming the above is correct), is if (cur.n_embd == -1) {
return result;
}
if (result.n_embd != -1 && result.n_embd != cur.n_embd) {
fprintf(stderr, "%s: control vector in %s does not match previous vector dimensions\n", __func__, info.fname.c_str());
return result;
} If hard exits are discouraged / not allowed, then perhaps a |
This function has several leaks, and at least some of them were fixed in #6289, but that was never merged. It would be great to merge at least the fixes from that PR. The core llama.cpp library should not crash on bad inputs, but this code is only part of the examples and not the core library, and it doesn't need to conform to the same requirements. However, in practice these functions are necessary to use control vectors, and in my opinion eventually should be moved to llama.cpp. |
Thanks, I'll have a look through the other PR and see if I can incorporate any of the leaks they fixed. I'll start a new PR for this now. |
See: #8137 |
…ions in `llama.cpp` [needs testing] (ggerganov#8060) * fixes ggerganov#7999 The `build_command_r` forgot to add the control vector. * Fixes qwen2 too * Fixed all models' control vectors * Removed double calls to `cb(cur, "l_out", il)` * Moved control vector logic to llama_control_vector:apply_to()
…ions in `llama.cpp` [needs testing] (ggerganov#8060) * fixes ggerganov#7999 The `build_command_r` forgot to add the control vector. * Fixes qwen2 too * Fixed all models' control vectors * Removed double calls to `cb(cur, "l_out", il)` * Moved control vector logic to llama_control_vector:apply_to()