You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current tokenizers use rust-lapper as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become even faster if we make some assumptions about our data, like is it sorted?
Things we should decide on:
Can we replace rust-lapper with AIList?
Can we create a version of the tokenizers that assumes sorted files? Call it a SpeedTokenizer
Should the above SpeedTokenizer be checking for sorted-ness?
The text was updated successfully, but these errors were encountered:
The current tokenizers use
rust-lapper
as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become even faster if we make some assumptions about our data, like is it sorted?Things we should decide on:
rust-lapper
withAIList
?SpeedTokenizer
SpeedTokenizer
be checking for sorted-ness?The text was updated successfully, but these errors were encountered: