-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deepseek coder merge #5464
Deepseek coder merge #5464
Conversation
@@ -211,6 +213,59 @@ def from_model_architecture(model_architecture): | |||
return MiniCPMModel | |||
if model_architecture == "BertModel": | |||
return BertModel | |||
|
|||
@staticmethod | |||
def from_model_name(model_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this ever used? And why this function duplicated below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this ever used? And why this function duplicated below?
I'm not sure. Looks like a convenient function for mapping if someone needs it in future convert-hf-to-gguf work. (The duplication is likely my fault, of course).
Should I get rid of it, comment it out? (Remember, I'm just merging their contributed deepcode/hf tokenizer code over mostly blindly -- although it does work. Resolves that out-of-range error too.) @ggerganov.
Let me know -- I'll also have to rebase and resubmit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked - it looks like ggerganov accidentally dropped this in d24da31
(#4070). It's apparently used for forcing the model used via the command line? This will really be out-of-place after #5825, it should probably just be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now - DeepseekCoderModel and DeepseekLLMModel can't be disambiguated from the model architecture alone. This should be changed so that they use a single class that derives tokenizer_model from either the model's config, or the command-line arguments if it really is a user choice.
It's honestly not clear to me why LlamaForCausalLM is referenced at all in convert-hf-to-gguf.py - convert.py is already capable of dealing with a llama model with a non-SPM tokenizer, and has superior memory management (so it's faster).
} else if (tokenizer_name == "bert") { | ||
vocab.type = LLAMA_VOCAB_TYPE_WPM; | ||
|
||
// default special tokens | ||
vocab.special_bos_id = 101; | ||
vocab.special_eos_id = 102; | ||
vocab.special_unk_id = 100; | ||
vocab.special_sep_id = -1; | ||
vocab.special_pad_id = -1; | ||
vocab.add_space_prefix = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabs -> spaces
In looking further into it, the whitespace and additional function are minor issues compared to the rewriting of the tokenizer (eg. @ggerganov's point "This is a big change in the tokenizer and I'm a little worried it might break something.") |
It would be great to wrap this up and add Deepseek models. But someone has to carefully look into the regex preprocessing changes IIUC this change takes a more general approach for tokenization pre-processing by applying regex with There is also some tokenization work done in #5613 which should be considered I think the best thing to do is:
Since this might take more time to implement, we can revert Anyone interested in helping out with this? |
I'm really happy to see everyone working hard on implementing the pretokenize mechanism. I apologize for not addressing related issues sooner due to being busy with other matters recently. One issue I'd like to mention is that in my original implementation #4070 , I used |
I have an idea: What if we offer the new tokenizer as a command-line
option, so those needing it can be the ones evaluating it -- a way of
getting some beta testing in? (If I recall correctly from the original pr
discussion it didn't just impact some types of splitting, but was necessary
to avoid some errors?)
(Again I'll disclaim: I was mostly operating in this PR as a "moving man"
merging in the aging PR. I don't know much about the furniture.)
…On Sun, Mar 3, 2024, 9:54 PM Bingxuan Wang ***@***.***> wrote:
I'm really happy to see everyone working hard on implementing the
pretokenize mechanism. I apologize for not addressing related issues sooner
due to being busy with other matters recently. One issue I'd like to
mention is that in my original implementation #4070
<#4070> , I used wcregex to
enhance the speed of regex matching. However, the dependency on wchar,
which has different default data types for compilers on Unix and
Mac/Windows, has remained unresolved. So I think it can only works fine
on Unix right now. This is mainly because I lack experience with
cross-platform C++ compilation. I'm hoping someone can help out with this.
—
Reply to this email directly, view it on GitHub
<#5464 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE3AV7LWWTSZJ6X27EGBSSDYWQEDDAVCNFSM6AAAAABDEUHLT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVG44DIMJZGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Adding an option will become too messy. The old tokenizer is not necessarily wrong, it's just that it implements one of the many different BPE pre-tokenization regexes: https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py At least this is how I understand it. Regarding the compatibility with Mac/Windows - I don't know enough, but if it can be fixed by going to the slower |
@DOGEwbx Note: I have no say in this project, so don't take that as me saying it means that's a way of getting the patch finalized and approved. :}} |
Superseded by #6920 |
...the long drawn out PR at:
#4070
"Update gpt2 preprocess and add deepseek coder preprocess"
I went ahead and merged it, fixing their whitespace issue (I think) that was holding up acceptance of the PR, and manually resolving the conflicts resulting from their fork being over 400 commits behind master). I tested it (just running magicoder -- a model needing deepseek_coder tokenizer) and "it works", but .. I hope I did everything right in the merge. :)