Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why MLC_ENABLE_SENTENCEPIECE_TOKENIZER OFF by default? #45

Open
korciuch opened this issue Oct 13, 2024 · 5 comments
Open

Why MLC_ENABLE_SENTENCEPIECE_TOKENIZER OFF by default? #45

korciuch opened this issue Oct 13, 2024 · 5 comments

Comments

@korciuch
Copy link

korciuch commented Oct 13, 2024

Should MLC_ENABLE_SENTENCEPIECE_TOKENIZER be on by default in CMakeLists.txt? I had to turn it on in order to successfully run ./build_and_run.sh to build the example target. Otherwise, I get a assert failure at src/sentencepiece_tokenizer.cc:

#else
std::unique_ptr<Tokenizer> Tokenizer::FromBlobSentencePiece(const std::string& model_blob) {
  assert(false);
  throw;
}
#endif  // MLC_ENABLE_SENTENCEPIECE_TOKENIZER
@zhaoxuejun1234
Copy link

Hi,have you figured that out? I have the same issue now

@tqchen
Copy link
Contributor

tqchen commented Nov 12, 2024

cc @MasterJH5574 shall we turn it on by default (we can always add turn off by default in downstream)?

@MasterJH5574
Copy link
Member

Sorry for the delayed response. Yes we can enable it. Will follow up in these two days.

@MasterJH5574
Copy link
Member

We enabled SentencePiece in this PR #47 and have bumped it in mlc-llm accordingly mlc-ai/mlc-llm#3025. Please check out the latest code, thanks!

@navy985
Copy link

navy985 commented Nov 25, 2024

Hi,I open MLC_ENABLE_SENTENCEPIECE_TOKENIZER ON . The tokenized result is different from transformers. How can I resolve ?

==============================tokenizers-cpp==================
Run example SentencePieceTokenizerExample() use OpenGVLab/InternVL2-2B/tokenizer.model and the special prompt.
The tokenized result:
[333, 352, 449, 6368, 352, 527, 333, 352, 449, 5064, 352, 330, 1008, 364, 3993, 505, 410, 387, 11498, 2327, 446, 7016, 345, 333, 352, 449, 6368, 352, 527, 333, 352, 449, 5064, 352, 330, 525, 11353, 364]

void TestTokenizer(std::unique_ptr tok, bool print_vocab = false,
bool check_id_back = true) {
// Check #1. Encode and Decode
std::string message = "What is the capital of Canada?";
std::string prompt = std::string("<|im_start|>") + "user\n" + message + "<|im_end|>\n" +"<|im_start|>" + "assistant\n";
std::vector ids = tok->Encode(prompt);
std::string decoded_prompt = tok->Decode(ids);
PrintEncodeResult(ids);
std::cout << "decode="" << decoded_prompt << """ << std::endl;
assert(decoded_prompt == prompt);

//......
}

==============================python transfomers======================
from transformers import AutoTokenizer
path = 'OpenGVLab/InternVL2-2B' #which include tokenizer.model
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
message = "What is the capital of Canada?"
prompt = "<|im_start|>" + "user\n" + message + "<|im_end|>\n" +"<|im_start|>" + "assistant\n"
out = tokenizer.encode(prompt)
print(out)
#[1, 92543, 1008, 364, 3993, 505, 410, 387, 11498, 2327, 446, 7016, 345, 92542, 364, 92543, 525, 11353, 364]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants