-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
embeddings #16
Comments
let text = "hello world"
let llm = LLM(...)
let embeddings = llm.encode(text) and for decoding: let decodedText = llm.model.decode(embeddings) |
Thanks, I thought about vector embeddings though: [Token] would just be ≈ one int per word |
i think you have the wrong definition of LLM embeddings, and it's understandable because i was also once confused about the concept. you might want to check this comment. it's also the reason why i chose not to use the word "embedding" in this library if you want to test the similarities between embeddings, you can cast the however, for checking similarity between simple words like "king" or "queen" in your example, i suggest you to just use the apple's although i haven't tested this myself but for just checking similarities between sentences or words you should use this similarity-search-kit library or use it with this together. |
Thanks, very very helpful links!!! LLM embeddings are usually a vector of floats, Token encodings are a vector of ints, casting these to float makes no sense, no confusion here ;) |
i was just saying that you have the option to cast an array of ints as array of floats so that you can check cosine similarity. after all, int is also just a valid one dimensional vector. i'm glad i was able to help you! |
so, i researched a bit further on this, being not so sure if i understood the concept correctly. what you are referring to is not related to LLM, indeed. however, embedding models are usually used with LLM, usually for text search, and that's where the confusion occurs. aside from the fact that sometimes some refer tokens as embeddings, that is. for example, mistral uses |
word2vec embeddings were not related to LLM, but today embeddings are (mostly) done via LLMs, or as you pointed out correctly SLMs (small language models) although some believe that using truely large LLMs also give better embeddings.
I think your research yielded a wrong result there. while using all current activations as embedding would be overkill, LLM embeddings are indeed calculated through some activations: • Pooling Strategies: Applying operations such as mean or max pooling over activations from one or more layers to create fixed-size embeddings. |
thank you for the clarification and a clear explanation. i was the one who had the wrong idea. my bad. i'll look into this more, find a way to get embeddings through methods that you described, and keep you updated here. i really appreciate it. it's hard to get the right information on LLM related field as a non-researcher. i have to learn more on this. i'll see if i can implement this in my library referencing this code: until then, in |
Hi, great project!
How hard would it be to extract embeddings from the LLMs?
The text was updated successfully, but these errors were encountered: