Skip to content

Commit

Permalink
Various changes from blaze2004's PR.
Browse files Browse the repository at this point in the history
* Add TypeScript types definition file
* Refactor tokenizer into a Class
* Allow passing custom vocab and merge data to tokenizer
* Allow passing custom tests to tokenizer test runner

Co-authored-by: Shubham Tiwari <[email protected]>
  • Loading branch information
belladoreai and blaze2004 committed Mar 24, 2024
1 parent b88929e commit 93ed89d
Show file tree
Hide file tree
Showing 5 changed files with 393 additions and 352 deletions.
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Developed by [belladore.ai](https://belladore.ai)

## Import

Option 1: Install as an npm package and import as ES6 module
Recommended way: Install as an npm package and import as ES6 module

```
npm install llama-tokenizer-js
Expand All @@ -29,12 +29,16 @@ import llamaTokenizer from 'llama-tokenizer-js'
console.log(llamaTokenizer.encode("Hello world!").length)
```

Option 2: Load as ES6 module with `<script>` tags in your HTML
Alternative: Load as ES6 module with `<script>` tags in your HTML

```
<script type="module" src="https://belladoreai.github.io/llama-tokenizer-js/llama-tokenizer.js"></script>
```

Alternative: for TypeScript projects, imports [should](https://github.com/belladoreai/llama-tokenizer-js/issues/12#issuecomment-1790073415) work now with the `types.d.ts` file, but please file an issue if I need to change something.

Alternative: for CommonJS projects, [should](https://github.com/belladoreai/llama-tokenizer-js/issues/10) work with `const llamaTokenizer = await import('llama-tokenizer-js');`

## Usage

Once you have the module imported, you can encode or decode with it. Training is not supported.
Expand Down Expand Up @@ -101,4 +105,9 @@ When you see a new LLaMA model released, this tokenizer is mostly likely compati

If you want to modify this library to support a new LLaMA tokenizer (new as in trained from scratch, not using the same tokenizer as most LLaMA models do), you should be able to do so by swapping the vocabulary and merge data (the 2 long variables near the end of `llama-tokenizer.js` file). This repo has [a Python script](data-conversion.py) for your convenience.

You can pass custom vocab and merge data to the tokenizer by instantiating it like this:

```
import { llamaTokenizer } from 'llama-tokenizer-js'
const tokenizer = new LlamaTokenizer(custom_vocab, custom_merge_data);
```
Loading

0 comments on commit 93ed89d

Please sign in to comment.