Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepseek coder merge #5464

Closed
wants to merge 2 commits into from
Closed

Conversation

jaggzh
Copy link
Contributor

@jaggzh jaggzh commented Feb 12, 2024

...the long drawn out PR at:
#4070

"Update gpt2 preprocess and add deepseek coder preprocess"

I went ahead and merged it, fixing their whitespace issue (I think) that was holding up acceptance of the PR, and manually resolving the conflicts resulting from their fork being over 400 commits behind master). I tested it (just running magicoder -- a model needing deepseek_coder tokenizer) and "it works", but .. I hope I did everything right in the merge. :)

@@ -211,6 +213,59 @@ def from_model_architecture(model_architecture):
return MiniCPMModel
if model_architecture == "BertModel":
return BertModel

@staticmethod
def from_model_name(model_name: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this ever used? And why this function duplicated below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this ever used? And why this function duplicated below?

I'm not sure. Looks like a convenient function for mapping if someone needs it in future convert-hf-to-gguf work. (The duplication is likely my fault, of course).
Should I get rid of it, comment it out? (Remember, I'm just merging their contributed deepcode/hf tokenizer code over mostly blindly -- although it does work. Resolves that out-of-range error too.) @ggerganov.

Let me know -- I'll also have to rebase and resubmit.

Copy link
Collaborator

@cebtenzzre cebtenzzre Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked - it looks like ggerganov accidentally dropped this in d24da31 (#4070). It's apparently used for forcing the model used via the command line? This will really be out-of-place after #5825, it should probably just be removed.

Copy link
Collaborator

@cebtenzzre cebtenzzre Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now - DeepseekCoderModel and DeepseekLLMModel can't be disambiguated from the model architecture alone. This should be changed so that they use a single class that derives tokenizer_model from either the model's config, or the command-line arguments if it really is a user choice.

It's honestly not clear to me why LlamaForCausalLM is referenced at all in convert-hf-to-gguf.py - convert.py is already capable of dealing with a llama model with a non-SPM tokenizer, and has superior memory management (so it's faster).

Comment on lines +3233 to +3242
} else if (tokenizer_name == "bert") {
vocab.type = LLAMA_VOCAB_TYPE_WPM;

// default special tokens
vocab.special_bos_id = 101;
vocab.special_eos_id = 102;
vocab.special_unk_id = 100;
vocab.special_sep_id = -1;
vocab.special_pad_id = -1;
vocab.add_space_prefix = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabs -> spaces

@jaggzh
Copy link
Contributor Author

jaggzh commented Mar 2, 2024

In looking further into it, the whitespace and additional function are minor issues compared to the rewriting of the tokenizer (eg. @ggerganov's point "This is a big change in the tokenizer and I'm a little worried it might break something.")
I'm not the person to be able to assist very much in evaluating it either.

@ggerganov
Copy link
Owner

It would be great to wrap this up and add Deepseek models. But someone has to carefully look into the regex preprocessing changes

IIUC this change takes a more general approach for tokenization pre-processing by applying regex with std::regex. The regex data is generated via some 3rd party tool (see #4070). This is in contrast to the old way where we had custom implementation of a specific BPE regex. The advantage of the latter is that it is fast. However it is hard to extend for other regexes.

There is also some tokenization work done in #5613 which should be considered

I think the best thing to do is:

  • Refactor unicode.h (+ potentially unicode.cpp) in such a way that it supports both std::regex as proposed here, but can fallback to custom optimized implementations when available
  • Add tools and/or instructions in the repo for re-creating the regexes using the https://github.com/Genivia/RE-flex tool (or something else if appropriate)
  • Make it clear in the model loading logs which regex is being used

Since this might take more time to implement, we can revert bpe_gpt2_preprocess() to the original implementation before this PR in order to keep things as they are (even if not 100% correct in certain cases). And after we merge this PR, we can start working on the above

Anyone interested in helping out with this?

@ggerganov ggerganov added help wanted Extra attention is needed good first issue Good for newcomers labels Mar 2, 2024
@DOGEwbx
Copy link

DOGEwbx commented Mar 4, 2024

I'm really happy to see everyone working hard on implementing the pretokenize mechanism. I apologize for not addressing related issues sooner due to being busy with other matters recently. One issue I'd like to mention is that in my original implementation #4070 , I used wcregex to enhance the speed of regex matching. However, the dependency on wchar, which has different default data types for compilers on Unix and Mac/Windows, has remained unresolved. So I think it can only works fine on Unix right now. This is mainly because I lack experience with cross-platform C++ compilation. I'm hoping someone can help out with this.

@jaggzh
Copy link
Contributor Author

jaggzh commented Mar 4, 2024 via email

@ggerganov
Copy link
Owner

Adding an option will become too messy.

The old tokenizer is not necessarily wrong, it's just that it implements one of the many different BPE pre-tokenization regexes:

https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py

At least this is how I understand it.

Regarding the compatibility with Mac/Windows - I don't know enough, but if it can be fixed by going to the slower std::regex version (the one that was proposed before wcregex) then we can do that. But again, we want to do that just for the new Deepseek models.

@jaggzh
Copy link
Contributor Author

jaggzh commented Mar 5, 2024

@DOGEwbx
So, all I did was merge your older work by hand (due to all the conflicts it had being so far behind master). Are you up to the task of modifying it (either this PR or redoing the merge in yours -- it wasn't that complex), adding in the automatic use of the newer tokenization/regex for deepseek models?

Note: I have no say in this project, so don't take that as me saying it means that's a way of getting the patch finalized and approved. :}}

@ggerganov
Copy link
Owner

Superseded by #6920

@ggerganov ggerganov closed this May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants