Hierarchical tokenizers #27

nleroy917 · 2024-06-13T19:01:06Z

This PR introduces a few new things:

1. A new "tokenizer-config" spec

It is now preferred that we instantiate tokenizers with a tokenizers.toml file. This was added to support hierarchical tokenizers which might take many BED files to instantiate.

Documentation is needed for the tokenizer.toml config files

2. A new hierarchical tokenizer

This directly addresses discussion over in #25, https://github.com/databio/geniml_dev/issues/85, and https://github.com/databio/geniml_dev/issues/79.

A hierarchical tokenizer can take many universes as input, establishing a priority of tokenization. The primary goal here is to significantly reduce the number of UNK token hits when tokenizing datasets

3. A new MetaTokenizer

Another extension of the tokenizers, the MetaTokenizer implements the "meta-tokenization" idea we had. In brief, clusters of highly similar regions (regions that cluster super close in embedding space), are all mapped to a singular "meta token", with the primary goal of drastically reducing vocab sizes for our models to improve training and inference speed and RAM requirements.

And of course, python bindings for all of it are implemented. I've also removed a lot of code that was antiquated and unused. While doing so, I spent considerable time revamping the documentation and tests.

codecov-commenter · 2024-06-14T19:31:29Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Hierarchical tokenizers

nleroy917 added 7 commits June 12, 2024 19:31

work on tokenizer config

2338ac0

update tests

47a33ae

work on hc universes

78ee6a0

add tests githu action

6500c77

work on README and codecov

983075a

switch working directory

7f0b654

update tokenizer config

b912538

nleroy917 added 21 commits June 14, 2024 16:39

realy update documentation

bf39b5c

fix doc tests

ae37dab

docs for common/utils

9f04a08

work on documentation

c1121b5

work on meta-token tokenizer

a2af7c1

finish making meta tokenizer... hopefully

3b52388

basic implementation of the meta tokenizer

74518ef

update test data

8ba4294

flush out the meta tokenizer

797b204

add python bindings to the meta tokenizer

8342c2e

remove gtokens

b9cb071

add token param

a0bc942

add dynamic tokenizer builder

0c9889b

update tests for saving tokens.gtok

cac5261

WIP TokenizerBuilder

df7df86

meta tokenizer updates

429b69e

add export functionality to tokenizers

4e831c2

small tweaks

e70cf13

bindings

c8abb9b

add module paths to classes

27fe07e

bump version and changelog

f636780

nleroy917 merged commit 8c0811a into master Jul 29, 2024
4 checks passed

nleroy917 added a commit that referenced this pull request Oct 7, 2024

Merge pull request #27 from databio/dev

4f83bc2

Hierarchical tokenizers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical tokenizers #27

Hierarchical tokenizers #27

nleroy917 commented Jun 13, 2024 •

edited

Loading

codecov-commenter commented Jun 14, 2024

Hierarchical tokenizers #27

Hierarchical tokenizers #27

Conversation

nleroy917 commented Jun 13, 2024 • edited Loading

1. A new "tokenizer-config" spec

2. A new hierarchical tokenizer

3. A new MetaTokenizer

codecov-commenter commented Jun 14, 2024

Welcome to Codecov 🎉

nleroy917 commented Jun 13, 2024 •

edited

Loading