Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cvector-generator example #7514

Merged
merged 58 commits into from
Jun 15, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 24, 2024

Resolve #6880

Result from last working version: #7514 (comment)

  • Get hidden layer embeddings
  • Calculate diff between positive and negative prompts
  • Implement PCA
  • Export output to gguf file
  • Support for multiple pairs of positive/negative prompts
  • Add README

TODO in next PRs:

@ngxson ngxson added the help wanted Extra attention is needed label May 24, 2024
@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 24, 2024
@christianazinn
Copy link
Contributor

Could you add a quick usage summary - do you just run ./control-vector-generator -m model ... like usual inferencing?

Also, tried implementing PCA using the ggml library here. Maybe I'm using the wrong methods, but ggml_norm and ggml_norm_inplace always just return a zero vector, and there aren't any docs I can find to set me right - I just want to normalize those vectors to length 1 so repeated calls to ggml_mul_mat don't blow out of precision. Feel free to mess with the linked snippet.

@ngxson
Copy link
Collaborator Author

ngxson commented May 24, 2024

Hi @christianazinn and thanks for your response. We'll move the discussion to here.

Quick explanation: my code has been able to take a pair of positive + negative prompt, calculate embeddings for each layer, and then substract to get the diff. In the end, for each layer, we have one matrix with shape [n_embd, n_tokens]. What we can do now is to reduce it into one single vector [n_embd] using PCA.

The way to use it: ./control-vector-generator -m model.gguf. For now, you need to modify the prompts inside the code (by default, "happy" vs "sad")

It is not urgent so take your time. And feel free to let me know if you have other questions. Thank you.

@christianazinn
Copy link
Contributor

Looking into PCA implementation and I realize we have the problem that we're not actually getting square matrices from get-hidden-layers (and one cannot retrieve eigenvectors directly from nonsquare matrices), but this is easily bypassed by multiplying the matrix by its transpose and doing power iteration on that.

However, it appears the matrices we receive are usually tall and skinny. SciPy's original implementation indicates that in this case, the problem is best handled by SVD with the covariance matrix. We may care to implement this after everything else works.

I also don't have push permissions to this branch so whatever changes I make, I'll fork the branch and PR into it.

@ngxson
Copy link
Collaborator Author

ngxson commented May 29, 2024

@christianazinn Thanks for the explanation. Yes I was also wonder how can we turn the embedding vectors into square matrix. It's all clear for me now.

I'll have a look during the weekend. In the meantime, I invited you to my forked repo. You can push directly onto this branch, or you can work on your own PR if you want. Feel free to tag me if you have questions. Thank you !

Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish.
@christianazinn
Copy link
Contributor

Thank you, have pushed an implementation with primitives/stdlib. Currently assumes Mistral architecture for the model_hint, but it successfully creates a control vector that is recognized for inference by llama.cpp. It is, of course, very slow, and many things still need to be implemented. I've marked what needs to be "translated" for reference, and left a few TODOs around.

Currently, however, it outputs gibberish when inferencing: e.g. [/AVAILABLE_TOOLS][control_26][control_10][control_31][control_17][control_36][/INST][/TOOL_RESULTS][control_20][TOOL_RESULTS][control_32][control_33][control_23][control_31][control_31][control_20][control_34][/TOOL_RESULTS][control_27][control_8][control_18][INST][control_30]</s> [end of text]. I am not sure why this happens. How are we retrieving positive/negative prompts - do we use the same completion format as the Python implementation, or something else?

Added basic command-line parameters for outfile and one each positive/negative prompt.

Refactored some messy code in PCA computation and GGUF exporting.

Left a bunch of comments regarding further work needed.
@christianazinn
Copy link
Contributor

Notes follow.

I have implemented basic command-line arguments for --outfile, --positive, and --negative. Currently we only support one each of positive/negative prompts.

I've left a few comments about what needs to be fixed in my shoddy implementation, and other things we need to deal with, such as the prompt parsing thing mentioned. It appears we do just parse the individual positive/negative prompts - @ngxson confirm? We will likely want to change this to provide a larger sample space; the blogpost and Python implementation provide reference on implementation.

However, I am seeing promising results with "funny" vs. "boring". Llama2 Q8_0, prompt (for completion) "Here's a funny joke: ". Llama2 was used because #5970 indicates support has not been implemented for architectures other than Llama, but that is probably outdated.

Control vector -1: What do you call a group of paintballs in space? Gravity does not affect them! (and a lot of other very unfunny jokes.)
Control vector 1: A man walked into a library and asked the librarian, "Do you have any books on the history of Madness?" The librarian replied, "It's not a very good idea to write a book on the history of Madness. You will just get a lot of people asking for their money back." (the others were not great, but better than the -1 group.)
No control vector: Why don't scientists trust atoms? Because they make up everything! (and many other common jokes.)

@ngxson
Copy link
Collaborator Author

ngxson commented May 30, 2024

@christianazinn Wow this is awesome. I quickly had a look at the code, looks good to me. I'm try when I get back to home.

It appears we do just parse the individual positive/negative prompts

I started with single pair of pos-neg for simplification. But yes, eventually we will allow to have multiple pairs of pos - neg. The python implementation does that by calculating mean value of output direction vectors. We could do the same, but I'm thinking if we can even go a bit further by using LERP.

We can allow the program to take as input 2 file of prompts (one prompt per line), so we have 2 file: neg.txt and pos.txt for example. I can implement this quickly if needed.

However, I am seeing promising results with "funny" vs. "boring"

Very promising result. Even me (a human) sometimes struggle to control my own funny / boring vector.

@christianazinn
Copy link
Contributor

christianazinn commented May 30, 2024

@christianazinn Wow this is awesome. I quickly had a look at the code, looks good to me. I'm try when I get back to home.

Thank you! Take your time, I will keep testing in the meantime. Other results are varied: a test on happy/sad generates complete gibberish, and another control vector for funny/boring is ineffective.

I started with single pair of pos-neg for simplification. But yes, eventually we will allow to have multiple pairs of pos - neg.

Just to make sure we are on the same page, because there are two places where multiple pairs might be needed. We will also want to implement multiple sentiment pairs (i.e. happy/sad and funny/boring), but what I referred to was having multiple prompts generated from the same sentiment pair run through the tokenizer as in the second code block here. Currently we appear to just tokenize the term e.g. funny and boring, while the Python implementation tokenizes a template e.g. [INST] Pretend you're a funny person making statements about the world. [/INST] The. This means upon passing the tokens through the model, we see inference on completing the template, which is much more accurate than passing just a single-word prompt.

I think we want to be able to do that preprocessing in C++, so the user inputs the positive/negative sentiments and we create the template, format it, and pass it to get_hidden_layers. See below commit for an example of creating the prompts. Were you thinking of this as well?

I believe the great variance in my results may be due to only having one sample token sequence per sentiment, and therefore high variability in the resulting vectors between runs, hence my concern over this topic. However, more runs of PCA would slow down the already slow stdlib implementation to the point of unusability, so that is left for the GGML implementation.

Implements an example template set built from the positive/negative prompts like the control vector Python implementation.
@christianazinn
Copy link
Contributor

christianazinn commented May 30, 2024

It appears the way the Python implementation handles concatenating the matrices from each different prompt callback is by stacking them, so e.g. if each callback returned a 4096x2 matrix then using 1024 test prompts would yield a 4096x2048 matrix. Intuitively because rank AA^T = rank A this allows for more degrees of freedom/less dependency on each individual callback in each layer's overall matrix, and since the result will be 4096x4096 regardless of the other dimension this should not change much with the PCA. Will try to implement this.

(Strictly, it vertically stacks, but it doesn't matter since we multiply by transpose anyway.)

@ngxson
Copy link
Collaborator Author

ngxson commented May 30, 2024

I updated this PR with 2 small changes (feel free to test / adapt it if you want):

  • added arguments --positive-file and --negative-file for adding multiple prompts
  • output vector is the mean value of all pair of prompts
  • add multi-thread for PCA ==> mostly for better debugging experience, feel free to remove it if needed

@ngxson
Copy link
Collaborator Author

ngxson commented May 30, 2024

@christianazinn I'm having a problem is that power_iteration always returns vector with all elements are 0. I tried revert my multi-thread hack but that doesn't resolve the problem. I verified that v_diff is good before passing it to pca. Maybe something deeper here.. could you have a look? (And sorry if I broke anything). Thanks in advance.

@christianazinn
Copy link
Contributor

@ngxson I'll take a look, thanks - not sure how I didn't think to check that, would explain why I was getting gibberish on 9/10 tests. My code is very patchwork at the moment, so there's likely to be a lot of these fixes. Thanks for the progress so far.

@christianazinn
Copy link
Contributor

christianazinn commented May 31, 2024

Strangely each matrix returned by square_diff has exactly 2^22 nonzero elements with n_embd = 4096, in which case we would expect 2^24 total elements (testing on Llama 2). Columns 2048-4096 of rows 2048-4096 are nonzero, which explains this - only the bottom right quadrant of the 4096x4096 matrix is nonzero. This corroborates what I see returned by power_iteration, a vector with 2048 consecutive zero entries followed by 2048 nonzero entries. @ngxson were you getting all entries 0 or just the visible ones?

What's printed to stdout from cb_eval implies we get a 4096x2(x1x1) matrix from Llama 2 (and other similar models like Mistral). What's likely is one of these dimensions is entirely 0, which would explain the aforementioned behavior. Will keep looking into it.

UPDATE: Am I misunderstanding these lines (I assumed this means we get a 4096x2x1x1 matrix?):
image
@ngxson I'm unfamiliar with this, please advise.

UPDATE 2: I had my numbers backward with zero/nonzero. Even more confused now.
...and I fixed it because I had them backward again. Not my day today.

@christianazinn
Copy link
Contributor

What's likely is one of these dimensions is entirely 0, which would explain the aforementioned behavior. Will keep looking into it.

Thinking about it further, this isn't even true. I would still like to know how the dimensions are stored (image above). Is it a flattened matrix of dimensions cb_data.n_tokens x cb_data.n_embd, or cb_data.n_embd x cb_data.n_tokens? @ngxson Apologies for the repeated mentions, but would like answers. It looks like the former so far (first 4096 elements of each v_diff are zero and the second 4096 are nonzero), but that contradicts the 4096x2x1x1 dimension from above.

Frankly, this whole headache could probably be avoided if we just wrote the GGML implementation, but I don't know how.

@christianazinn
Copy link
Contributor

fixed it... one liner... ugh

printf("\n");
}

static int ctrlvec_params_parse_ex(int argc, char ** argv, ctrl_params & params) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge ctrl_params into gpt_params so that we have a consistent handling of CLI args in all examples

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't noticed that the gpt_params has been refactored. It's way easier to work with it now!

I moved ctrl_params to gpt_params. Please have a look on 679f513 . Thanks!

common/common.cpp Outdated Show resolved Hide resolved
common/common.cpp Outdated Show resolved Hide resolved
@christianazinn
Copy link
Contributor

What I'm planning is to only evaluate distinct tokens at distinct positions. This way, we can sure that we get non-duplicated vectors.

This should be fine - just test it. With what you mention below about

Instead of simply concatenate prompt with completions, we should also generate the complete sentence based on the given prompt+completion ?

that should work much better. I think that's actually what the Python implementation does but I'm not certain. Feel free to try it if you like, or if you think the current outputs are acceptable, we can add that in a later PR. (We should compile a list of future improvements for this.)

I'll add my review for the code itself in a moment, and will test the generated control vectors when I get the chance.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 12, 2024

(We should compile a list of future improvements for this.)

Actually I updated a list the description of this PR. Feel free to let me know if you have other ideas to add.

I'll add my review for the code itself in a moment, and will test the generated control vectors when I get the chance.

Nice. Thanks for taking time to develop and to review this PR!

@calvin-laurenson
Copy link
Contributor

I am very excited for control vectors and I have been routinely testing this PR. I got it to work yesterday with only a couple issues.

  1. Won't compile on my Windows machine without #include <ctime> in pca.hpp. It compiles fine on my Linux machine and looking at other parts of the project it seems like most of the time there is an include and time is used without the std:: prefix.
  2. I was unable to use a multiline prompt (for ChatML models like Dolphin) because it assumes I want to have multiple prompts. This manifested as a pretty weird error because it made both prompts exactly the same which made the subtraction make everything zeros (maybe there should be a check to see if the positive and negative prompts are the same?).
  3. Does not work on CUDA (I get ggml_backend_cuda_graph_compute: op not supported (view) (SQRT)). Not really much of problem because it runs pretty fast on CPU.

I fixed 1 and 2 in a PR in the fork ngxson#6. 2 is fixed by adding a command line flag to combine all of the prompt lines into one prompt.

@ngxson
Copy link
Collaborator Author

ngxson commented Jun 13, 2024

@calvin-laurenson Thanks for testing out

Regarding the ability to have multi-line prompt, I prefer to add --escape option, since it's already part of common.cpp. This will allow having multiple prompts, each prompt having multiple lines. I'll add this option in later stage:

-e,    --escape   process escapes sequences (\\n, \\r, \\t, \\', \\\", \\\\) (default: %s)", params.escape ? "true" : "false"

The problem with #include <ctime> will be resolve along with all conflicts with main branch (there is a rename examples/* --> llama-*)

For the problem with GPU, it seems like some _inplace ops are not supported by GPU backend. I'll try replacing them with the non-_inplace.

Edit: CUDA backend does not support GGML_OP_SQRT

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done tests, but I'm sure people will play with this and if there are any issues we can resolve them from master

Comment on lines 1984 to 1992
options.push_back({ "control-vector" });
options.push_back({ "cvector", "-o, --output FNAME", "output file (default: '%s')", params.cvector_outfile.c_str() });
options.push_back({ "cvector", "--positive-file FNAME", "positive prompts file, one prompt per line (default: '%s')", params.cvector_positive_file.c_str() });
options.push_back({ "cvector", "--negative-file FNAME", "negative prompts file, one prompt per line (default: '%s')", params.cvector_negative_file.c_str() });
options.push_back({ "cvector", "--completions-file FNAME","completions file (default: '%s')", params.cvector_completions_file.c_str() });
options.push_back({ "cvector", "--completions N", "number of lines of completions file to use (default: %d)", params.n_completions });
options.push_back({ "cvector", "--batch-pca N", "batch size used for PCA. Larger batch runs faster, but uses more memory (default: %d)", params.n_pca_batch });
options.push_back({ "cvector", "--iter-pca N", "number of iterations used for PCA (default: %d)", params.n_pca_iterations });

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whitespace padding should be kept so that the arguments are vertically aligned when the help is printed:

Suggested change
options.push_back({ "control-vector" });
options.push_back({ "cvector", "-o, --output FNAME", "output file (default: '%s')", params.cvector_outfile.c_str() });
options.push_back({ "cvector", "--positive-file FNAME", "positive prompts file, one prompt per line (default: '%s')", params.cvector_positive_file.c_str() });
options.push_back({ "cvector", "--negative-file FNAME", "negative prompts file, one prompt per line (default: '%s')", params.cvector_negative_file.c_str() });
options.push_back({ "cvector", "--completions-file FNAME","completions file (default: '%s')", params.cvector_completions_file.c_str() });
options.push_back({ "cvector", "--completions N", "number of lines of completions file to use (default: %d)", params.n_completions });
options.push_back({ "cvector", "--batch-pca N", "batch size used for PCA. Larger batch runs faster, but uses more memory (default: %d)", params.n_pca_batch });
options.push_back({ "cvector", "--iter-pca N", "number of iterations used for PCA (default: %d)", params.n_pca_iterations });
options.push_back({ "control-vector" });
options.push_back({ "cvector", "-o, --output FNAME", "output file (default: '%s')", params.cvector_outfile.c_str() });
options.push_back({ "cvector", " --positive-file FNAME", "positive prompts file, one prompt per line (default: '%s')", params.cvector_positive_file.c_str() });
options.push_back({ "cvector", " --negative-file FNAME", "negative prompts file, one prompt per line (default: '%s')", params.cvector_negative_file.c_str() });
options.push_back({ "cvector", " --completions-file FNAME",
"completions file (default: '%s')", params.cvector_completions_file.c_str() });
options.push_back({ "cvector", " --completions N", "number of lines from the completions file to use (default: %d)", params.n_completions });
options.push_back({ "cvector", " --batch-pca N", "batch size used for PCA. Larger batch runs faster, but uses more memory (default: %d)", params.n_pca_batch });
options.push_back({ "cvector", " --iter-pca N", "number of iterations used for PCA (default: %d)", params.n_pca_iterations });

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I also changed the example name + binary name to llama-cvector-generator


```
<|im_start|>system\nAct like a person who is extremely happy.<|im_end|>
<|im_start|>system\nYou are in a very good mood today<|im_end|>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calvin-laurenson I ended up enabling escape new line by default, which should be more convenient for most users.

@ngxson ngxson changed the title Add control-vector-generator example Add cvector-generator example Jun 13, 2024
@ngxson ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jun 14, 2024
@ngxson ngxson merged commit 0c7b359 into ggerganov:master Jun 15, 2024
66 checks passed
@person4268
Copy link

person4268 commented Jun 16, 2024

FYI, the help text refers to --iter-pca, but the code's looking for --pca-iter. The same applies to --pca-batch/--batch-pca

Also, if the completion portion bails out due to the number of positive prompts != negative prompts, PCA still tries to run:

Log
1 person4268@person4269 ~/source/llama.cpp/build/bin (git)-[master] % ./llama-cvector-generator -m /mnt4/models/L3-70B-Euryale-v2.1-IQ4_XS.gguf -ngl 19 -c 8192 --log-format text -fa --no-mmap --output /mnt4/models/h_sad_eur_cvec.gguf --pca-iter 2000 --pca-batch 100 --completions-file /mnt4/models/cvecs/completions.txt --positive-file /mnt4/models/cvecs/positive.txt --negative-file /mnt4/models/cvecs/negative.txt
main: build = 3153 (0c7b3595)
main: built with clang version 17.0.0 for x86_64-pc-linux-gnu
llama_model_loader: loaded meta data with 27 key-value pairs and 723 tensors from /mnt4/models/L3-70B-Euryale-v2.1-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = L3-70B-Euryale-v2.1
llama_model_loader: - kv   2:                          llama.block_count u32              = 80
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 30
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - kv  23:                      quantize.imatrix.file str              = /models/L3-70B-Euryale-v2.1-GGUF/L3-7...
llama_model_loader: - kv  24:                   quantize.imatrix.dataset str              = /training_data/calibration_datav3.txt
llama_model_loader: - kv  25:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  26:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  481 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 35.29 GiB (4.30 BPW) 
llm_load_print_meta: general.name     = L3-70B-Euryale-v2.1
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6700 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size =    0.74 MiB
llm_load_tensors: offloading 19 repeating layers to GPU
llm_load_tensors: offloaded 19/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  8261.44 MiB
llm_load_tensors:  ROCm_Host buffer size = 27877.86 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   608.00 MiB
llama_kv_cache_init:  ROCm_Host KV buffer size =  1952.00 MiB
llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =  1088.45 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    32.01 MiB
llama_new_context_with_model: graph nodes  = 2247
llama_new_context_with_model: graph splits = 675
number of positive and negative prompts must be equal
n_total_tokens: 0
Done evaluate prompts, unload model...
build_v_diff
print_debug_tensor: diff_0 (f32): [0, 8192]
print_debug_tensor: diff_0[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 245415337263104.000000, 13128352454087278592.000000, ... ]
print_debug_tensor: diff_1 (f32): [0, 8192]
print_debug_tensor: diff_1[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, -16349680.000000, 0.000000, ... ]
print_debug_tensor: diff_2 (f32): [0, 8192]
print_debug_tensor: diff_2[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 74943081985119092736.000000, -7654367232.000000, ... ]
print_debug_tensor: diff_3 (f32): [0, 8192]
print_debug_tensor: diff_3[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_4 (f32): [0, 8192]
print_debug_tensor: diff_4[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, -16349680.000000, 0.000000, ... ]
print_debug_tensor: diff_5 (f32): [0, 8192]
print_debug_tensor: diff_5[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 273573288019132153856.000000, 0.000000, ... ]
print_debug_tensor: diff_6 (f32): [0, 8192]
print_debug_tensor: diff_6[0] = [ -21950031649026397917780377600.000000, 0.000000, 0.000000, 0.000000, 273573288019132153856.000000, 0.000000, ... ]
print_debug_tensor: diff_7 (f32): [0, 8192]
print_debug_tensor: diff_7[0] = [ -21745479983637657800526528512.000000, 0.000000, 287592923290392330240.000000, 879223082556685096676811171954688.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_8 (f32): [0, 8192]
print_debug_tensor: diff_8[0] = [ -137126871040.000000, 0.000000, -137126871040.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_9 (f32): [0, 8192]
print_debug_tensor: diff_9[0] = [ -137126739968.000000, 0.000000, -80122.875000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_10 (f32): [0, 8192]
print_debug_tensor: diff_10[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_11 (f32): [0, 8192]
print_debug_tensor: diff_11[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_12 (f32): [0, 8192]
print_debug_tensor: diff_12[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_13 (f32): [0, 8192]
print_debug_tensor: diff_13[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_14 (f32): [0, 8192]
print_debug_tensor: diff_14[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_15 (f32): [0, 8192]
print_debug_tensor: diff_15[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_16 (f32): [0, 8192]
print_debug_tensor: diff_16[0] = [ -21957186034247945430279127040.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_17 (f32): [0, 8192]
print_debug_tensor: diff_17[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_18 (f32): [0, 8192]
print_debug_tensor: diff_18[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_19 (f32): [0, 8192]
print_debug_tensor: diff_19[0] = [ 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_20 (f32): [0, 8192]
print_debug_tensor: diff_20[0] = [ -21737560575045885405503160320.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_21 (f32): [0, 8192]
print_debug_tensor: diff_21[0] = [ -137127002112.000000, 0.000000, -8082344.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_22 (f32): [0, 8192]
print_debug_tensor: diff_22[0] = [ -137127002112.000000, 0.000000, -10023872.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_23 (f32): [0, 8192]
print_debug_tensor: diff_23[0] = [ -137127002112.000000, 0.000000, -79088.250000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_24 (f32): [0, 8192]
print_debug_tensor: diff_24[0] = [ -137127002112.000000, 0.000000, -78330.750000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_25 (f32): [0, 8192]
print_debug_tensor: diff_25[0] = [ -137127002112.000000, 0.000000, -78644.500000, 0.000000, 73604148164365473808384.000000, 73772818640122563267649339392.000000, ... ]
print_debug_tensor: diff_26 (f32): [0, 8192]
print_debug_tensor: diff_26[0] = [ -137127002112.000000, 0.000000, -813495668042629120.000000, 0.000000, 17256819553412347886305280.000000, 70799560714130813738592684212224.000000, ... ]
print_debug_tensor: diff_27 (f32): [0, 8192]
print_debug_tensor: diff_27[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -813519857298440192.000000, 0.000000, ... ]
print_debug_tensor: diff_28 (f32): [0, 8192]
print_debug_tensor: diff_28[0] = [ -137127133184.000000, 0.000000, -1052054506897932288.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_29 (f32): [0, 8192]
print_debug_tensor: diff_29[0] = [ -21947783802580551966658658304.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_30 (f32): [0, 8192]
print_debug_tensor: diff_30[0] = [ -137127133184.000000, 0.000000, -984517005161791488.000000, 0.000000, -1052057805432815616.000000, 0.000000, ... ]
print_debug_tensor: diff_31 (f32): [0, 8192]
print_debug_tensor: diff_31[0] = [ -21948409516139532194649473024.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_32 (f32): [0, 8192]
print_debug_tensor: diff_32[0] = [ -137127133184.000000, 0.000000, -10005248.000000, 0.000000, -3773708624480698368.000000, 0.000000, ... ]
print_debug_tensor: diff_33 (f32): [0, 8192]
print_debug_tensor: diff_33[0] = [ -21947842832161587837223829504.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_34 (f32): [0, 8192]
print_debug_tensor: diff_34[0] = [ -137127133184.000000, 0.000000, -80188.750000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_35 (f32): [0, 8192]
print_debug_tensor: diff_35[0] = [ -21771266465817367498215915520.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_36 (f32): [0, 8192]
print_debug_tensor: diff_36[0] = [ -137127133184.000000, 0.000000, -137127133184.000000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_37 (f32): [0, 8192]
print_debug_tensor: diff_37[0] = [ -21737565297412368275148374016.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_38 (f32): [0, 8192]
print_debug_tensor: diff_38[0] = [ -137127264256.000000, 0.000000, -6585195041076019200.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_39 (f32): [0, 8192]
print_debug_tensor: diff_39[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_40 (f32): [0, 8192]
print_debug_tensor: diff_40[0] = [ -137127264256.000000, 0.000000, -4103188278860054528.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_41 (f32): [0, 8192]
print_debug_tensor: diff_41[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -14249872.000000, 0.000000, ... ]
print_debug_tensor: diff_42 (f32): [0, 8192]
print_debug_tensor: diff_42[0] = [ -137127264256.000000, 0.000000, -4103104715976343552.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_43 (f32): [0, 8192]
print_debug_tensor: diff_43[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -14249872.000000, 0.000000, ... ]
print_debug_tensor: diff_44 (f32): [0, 8192]
print_debug_tensor: diff_44[0] = [ -137127264256.000000, 0.000000, -4103021153092632576.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_45 (f32): [0, 8192]
print_debug_tensor: diff_45[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -14249872.000000, 0.000000, ... ]
print_debug_tensor: diff_46 (f32): [0, 8192]
print_debug_tensor: diff_46[0] = [ -137127264256.000000, 0.000000, -4102871619511255040.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_47 (f32): [0, 8192]
print_debug_tensor: diff_47[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -14249872.000000, 0.000000, ... ]
print_debug_tensor: diff_48 (f32): [0, 8192]
print_debug_tensor: diff_48[0] = [ -137127264256.000000, 0.000000, -79094.125000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_49 (f32): [0, 8192]
print_debug_tensor: diff_49[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -14249872.000000, 0.000000, ... ]
print_debug_tensor: diff_50 (f32): [0, 8192]
print_debug_tensor: diff_50[0] = [ -137127264256.000000, 0.000000, -79077.000000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_51 (f32): [0, 8192]
print_debug_tensor: diff_51[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -0.000000, 0.000000, ... ]
print_debug_tensor: diff_52 (f32): [0, 8192]
print_debug_tensor: diff_52[0] = [ -137127264256.000000, 0.000000, -79081.375000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_53 (f32): [0, 8192]
print_debug_tensor: diff_53[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_54 (f32): [0, 8192]
print_debug_tensor: diff_54[0] = [ -137127264256.000000, 0.000000, -79078.750000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_55 (f32): [0, 8192]
print_debug_tensor: diff_55[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_56 (f32): [0, 8192]
print_debug_tensor: diff_56[0] = [ -137127264256.000000, 0.000000, -79077.875000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_57 (f32): [0, 8192]
print_debug_tensor: diff_57[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_58 (f32): [0, 8192]
print_debug_tensor: diff_58[0] = [ -137127264256.000000, 0.000000, -137127264256.000000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_59 (f32): [0, 8192]
print_debug_tensor: diff_59[0] = [ -137127002112.000000, 0.000000, -137127002112.000000, 0.000000, -0.000000, -0.000000, ... ]
print_debug_tensor: diff_60 (f32): [0, 8192]
print_debug_tensor: diff_60[0] = [ -137127395328.000000, 0.000000, -3773792187364409344.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_61 (f32): [0, 8192]
print_debug_tensor: diff_61[0] = [ -137127133184.000000, 0.000000, -137127133184.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_62 (f32): [0, 8192]
print_debug_tensor: diff_62[0] = [ -21957169505965255386520879104.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_63 (f32): [0, 8192]
print_debug_tensor: diff_63[0] = [ -137127395328.000000, 0.000000, -1338107850026647552.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_64 (f32): [0, 8192]
print_debug_tensor: diff_64[0] = [ -137127133184.000000, 0.000000, -137127133184.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_65 (f32): [0, 8192]
print_debug_tensor: diff_65[0] = [ -21957169505965255386520879104.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_66 (f32): [0, 8192]
print_debug_tensor: diff_66[0] = [ -137127395328.000000, 0.000000, -1298677164031344640.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_67 (f32): [0, 8192]
print_debug_tensor: diff_67[0] = [ -137127133184.000000, 0.000000, -137127133184.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_68 (f32): [0, 8192]
print_debug_tensor: diff_68[0] = [ -21950031649026397917780377600.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_69 (f32): [0, 8192]
print_debug_tensor: diff_69[0] = [ -137127395328.000000, 0.000000, -1298652974775533568.000000, 0.000000, -1549916962816.000000, 0.000000, ... ]
print_debug_tensor: diff_70 (f32): [0, 8192]
print_debug_tensor: diff_70[0] = [ -137127133184.000000, 0.000000, -137127133184.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_71 (f32): [0, 8192]
print_debug_tensor: diff_71[0] = [ -21949866366199497480197898240.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_72 (f32): [0, 8192]
print_debug_tensor: diff_72[0] = [ -137127395328.000000, 0.000000, -926125241145491456.000000, 0.000000, -1549916962816.000000, 0.000000, ... ]
print_debug_tensor: diff_73 (f32): [0, 8192]
print_debug_tensor: diff_73[0] = [ -137127133184.000000, 0.000000, -137127133184.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_74 (f32): [0, 8192]
print_debug_tensor: diff_74[0] = [ -21949866366199497480197898240.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_75 (f32): [0, 8192]
print_debug_tensor: diff_75[0] = [ -137127395328.000000, 0.000000, -926113146517585920.000000, 0.000000, -1549916962816.000000, 0.000000, ... ]
print_debug_tensor: diff_76 (f32): [0, 8192]
print_debug_tensor: diff_76[0] = [ -137127133184.000000, 0.000000, -137127133184.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_77 (f32): [0, 8192]
print_debug_tensor: diff_77[0] = [ -21947354067230610828944211968.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, ... ]
print_debug_tensor: diff_78 (f32): [0, 8192]
print_debug_tensor: diff_78[0] = [ -137127395328.000000, 0.000000, -1125619531377541120.000000, 0.000000, -1549916962816.000000, 0.000000, ... ]
run_pca: Running PCA...
GGML_ASSERT: /home/michael/source/llama.cpp/ggml.c:5284: !ggml_is_transposed(a)
ptrace: Operation not permitted.
No stack.
The program is not being run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples merge ready indicates that this may be ready to merge soon and is just holding out in case of objections need feedback Testing and feedback with results are needed Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate control vector using llama.cpp
8 participants