-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for special tokens #1931
Conversation
6a8e3ff
to
6c55fe1
Compare
A breaking change to the GGML format might be a tough sell (but don't take my personal opinion as speaking for the project in any way). You might consider allowing a commandline option and/or API addition to allow just reading the special tokens from a separate file or list.
llama.cpp is under MIT and transformers seems to be Apache 2.0. I'm not qualified to say what the issue is or how to fix it, but the projects do have different licenses. I don't know if there's any kind of policy for dealing with that situation already in place. Perhaps someone else can give you a better answer, but until that part is resolved I'd guess you shouldn't expect your PR to get merged (again, not speaking with any kind of authority). |
On the bright side, the old format is still supported... But I agree with you and it's also something I would want to avoid, but the thing is, after looking at the code it looks like everything related to the vocab is integrated inside the ggml (e.g. the file
Yes, this is something that concerns me a bit. I would appreciate feedback on whether I should be concerned about the license. After all, I'm not simply copy-pasting the code. Perhaps, the comment I added with the link to the original implementation would be sufficient?
You're absolutely right, no doubts about that, that's why I brought that up in the PR message, this is something quite important for an open-source project so it needs to be sorted out before the PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, there is a discussion on a new format here: ggerganov/ggml#220
You may want to chime in.
I don't believe I have anything to contribute to the discussion... but upon a closer look, it appears there is also a discussion regarding the tokenizer in the issue 🤔 |
Hey @Igoorx what's the status of this PR? I'm really interested in this work. I fine tune using LoRA and add a few special tokens, but then these aren't tokenized correctly when running inference on llama.cpp. I'm going to try your PR and see if it helps things. |
Thanks. I’ve started taking a stab at the merge conflicts but I imagine
I’ll get a few things wrong and it’ll end up consuming a lot of my time 😅
I read about GGUF but I’ve got some deadlines coming up and I’d like a
working solution sooner rather than later, and it’s not clear what kind of
timeline to expect GGUF on.
…On Mon, Aug 7, 2023 at 14:09 Igor Pissolati ***@***.***> wrote:
@grantbey <https://github.com/grantbey> It is finished, but since the
maintainers had no interest whatsoever in merging it I didn't fix the merge
conflicts. If you can't do that by yourself, you should just wait for GGUF,
since it's right around the corner: #2398
<#2398>
—
Reply to this email directly, view it on GitHub
<#1931 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXSGXU5IRMINHEY5PGJOX3XUDSKFANCNFSM6AAAAAAZLGK4LI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@grantbey I rebased the PR to the last master commit 👍 |
Ok wow @Igoorx you're amazing. Thank you so much! |
@goerch Yeah, that's fine. You just have to be aware about: Lines 554 to 555 in 6f7daba
The trie algorithm used in the PR is a port from the huggingface repository, as written in the comment, so maybe something needs to be done about it. I'm not sure if that comment is enough or if it would be necessary to add the hf license somewhere. |
The gguf gpt2 tokenizer also have a Trie implementation. The tokenizer is on MIT license. Maybe it could be reused for the llama tokenizer. |
IANAL, but I'm equally concerned about compatibility with the @ggerganov and others: any opinions on this? |
The author of the gpt2 tokenizer gave permission to use it and stated it is on MIT license here #2398 (comment) |
The important part is the Lines 575 to 577 in 6f7daba
If using the ported version isn't an option, it would be necessary to reimplement it using the other trie algorithm.
IANAL either, but I think Apache-2.0 and MIT should be compatible with each other: https://law.stackexchange.com/a/6732 |
It looks like the MIT and Apache licenses are compatible, but a copy of the Apache license and a Notice file must be included: |
I'll recommend to either implement a trie from scratch, or use a linear search algorithm - we are not tokenizing billions of tokens, so not sure what we gain from using a trie. |
If you say so then probably trie really is a premature optimization in this case... I changed the code to use linear search. |
Currently discussing this at #2820. Maybe close this one and participate over there? |
* Rewrite special token handling from #1931 * shorten param name, add st verification by type * use offsets instead of copy by substr * formatting, remove copying iterator on delete * llama : normalize code-style * swift fix * print pfx/sfx if verb, main: split pfx input sfx * dont add space when using special tokens * minor : comment + spacing --------- Co-authored-by: Georgi Gerganov <[email protected]>
Fixes #1812
Fixes #1501
Hello! This is my first attempt to contribute to this project, so I apologize in advance for any mistakes.
This PR should add a basic support for special tokens and improve the support for added tokens. All special tokens come from the file
added_tokens.json
and the filespecial_tokens_map.json
or the filetokenizer_config.json
(I have no idea if it's safe to rely on only one so I added tokenizer_config.json as a fallback).EDIT: The loading of the jsons now seems to follow the huggingface implementation a bit better.
The most important points of this PR are:
The GGML format was changed due to the requirement of a way to know which tokens are the special tokens.EDIT: This isn't necessary anymore.
The tokenizer now uses a trie algorithm to efficiently split the prompt based on the special tokens, this was necessary because the BPE tokenizer isn't able to tokenize the special tokens by itself.Please note that this algorithm was ported from the huggingface/transformers repository, so I wonder if this could cause license issues?
EDIT: The algorithm now is just a linear search.
Using this PR, this is the output of
--verbose-prompt
:when the model is converted to ggml with this
special_tokens_map.json
: