Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert : add t5 tokenizer tests #1

Merged

Conversation

ggerganov
Copy link

Add tokenizer tests + suggest init with UNK tokens

# get tokenizers
python3 convert-hf-to-gguf-update.py <hf_token>

# generate ggml vocab and tests
python3 convert-hf-to-gguf.py models/tokenizers/t5/ --outfile models/ggml-vocab-t5.gguf --vocab-only

# run the tests
make -j tests
./tests/test-tokenizer-0 models/ggml-vocab-t5.gguf

Currently, a few tests that are failing:

src: '!!!!!!'
res: ' !!!!!!'
tok: 3 17065 55 
main : failed test:    '!!!!!!'
main : detokenized to: ' !!!!!!' instead of ' !!!!!!'
main : expected tokens:      3 ' ',     55 '!',  17065 '!!!!!', 
main : got tokens:           3 ' ',  17065 '!!!!!',     55 '!', 

src: ' '
res: '▅'
tok: 2 
main : failed test:    ' '
main : detokenized to: '▅' instead of ''
main : expected tokens: 
main : got tokens:           2 '▅', 

@fairydreaming
Copy link
Owner

@ggerganov These tokenization test failures are caused by differences in tokenization between the transformers T5 "slow" tokenizer (T5Tokenizer) and "fast" tokenizer (T5TokenizerFast). My Unigram tokenizer implementation is compatible with the "slow" tokenizer. It looks like the default implementation returned by AutoTokenizer.from_pretrained(...) in convert-hf-to-gguf-update.py is T5TokenizerFast. I regenerated test inputs with T5Tokenizer by setting use_fast to false: AutoTokenizer.from_pretrained(..., use_fast=False) and they all passed. Try something like this:

@@ -141,7 +147,10 @@ for model in models:
 
     # create the tokenizer
     try:
-        tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+        if name == "t5":
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
     except OSError as e:
         logger.error(f"Error loading tokenizer for model {name}. The model may not exist or is not accessible with the provided token. Error: {e}")
         continue  # Skip to the next model if the tokenizer can't be loaded
@@ -299,7 +309,10 @@ for model in models:
 
     # create the tokenizer
     try:
-        tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+        if name == "t5":
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
     except OSError as e:
         logger.error(f"Failed to load tokenizer for model {name}. Error: {e}")
         continue  # Skip this model and continue with the next one in the loop

@ggerganov
Copy link
Author

Thank you - fixed

@fairydreaming fairydreaming merged commit 7c610fa into fairydreaming:t5-clean-3 Jul 2, 2024
1 check passed
fairydreaming pushed a commit that referenced this pull request Aug 4, 2024
* [example] batched-bench "segmentation fault"

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```

* Update examples/batched-bench/batched-bench.cpp

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: compilade <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants