Various changes from blaze2004's PR.

* Add TypeScript types definition file * Refactor tokenizer into a Class * Allow passing custom vocab and merge data to tokenizer * Allow passing custom tests to tokenizer test runner Co-authored-by: Shubham Tiwari <[email protected]>
belladoreai · Mar 24, 2024 · 93ed89d · 93ed89d
1 parent b88929e
commit 93ed89d
Show file tree

Hide file tree

Showing 5 changed files with 393 additions and 352 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ Developed by [belladore.ai](https://belladore.ai)
 
 ## Import
 
-Option 1: Install as an npm package and import as ES6 module
+Recommended way: Install as an npm package and import as ES6 module
 
 ```
 npm install llama-tokenizer-js
@@ -29,12 +29,16 @@ import llamaTokenizer from 'llama-tokenizer-js'
 console.log(llamaTokenizer.encode("Hello world!").length)
 ```
 
-Option 2: Load as ES6 module with `<script>` tags in your HTML
+Alternative: Load as ES6 module with `<script>` tags in your HTML
 
 ```
 <script type="module" src="https://belladoreai.github.io/llama-tokenizer-js/llama-tokenizer.js"></script>
 ```
 
+Alternative: for TypeScript projects, imports [should](https://github.com/belladoreai/llama-tokenizer-js/issues/12#issuecomment-1790073415) work now with the `types.d.ts` file, but please file an issue if I need to change something.
+
+Alternative: for CommonJS projects, [should](https://github.com/belladoreai/llama-tokenizer-js/issues/10) work with `const llamaTokenizer = await import('llama-tokenizer-js');`
+
 ## Usage
 
 Once you have the module imported, you can encode or decode with it. Training is not supported.
@@ -101,4 +105,9 @@ When you see a new LLaMA model released, this tokenizer is mostly likely compati
 
 If you want to modify this library to support a new LLaMA tokenizer (new as in trained from scratch, not using the same tokenizer as most LLaMA models do), you should be able to do so by swapping the vocabulary and merge data (the 2 long variables near the end of `llama-tokenizer.js` file). This repo has [a Python script](data-conversion.py) for your convenience.
 
+You can pass custom vocab and merge data to the tokenizer by instantiating it like this:
 
+```
+import { llamaTokenizer } from 'llama-tokenizer-js'
+const tokenizer = new LlamaTokenizer(custom_vocab, custom_merge_data);
+```