-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPT2 tokenizer setup and refactor code #56
Conversation
3d0d0c6
to
050e136
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for cleanup also! Just a couple of small minor improvements.
embedBatch, | ||
expectedOutputSeqPrepFn, | ||
} from '../tokens/token_gemb'; | ||
|
||
function tokenize_fn_test(input: string): number[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get what this function is for, can you add a comment to say a little about what it's intended to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a function for testing. Added a comment and simplified it.
…ing the local task token rep. There seems to be a memory leak, and a refactoring is necessary.
…ngle file. Refactor token gemb file so that embedBatch is adapted to a use case where a tokenizer is available.
…le with next N tokens prediction.
999531a
to
b2fbcc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
Main changes introduced are:
gpt-tokenizer
library.token_gemb.ts
file to make it simpler to extend it to other tokenizations.gpt2.ts
andtransformer_gtensor.ts
to a common file.Fixes Issues #54 & #52.
Refactoring the token embedding type and wrapping it with a class will be done in a follow-up CL.
There is also some work left on refactoring the tokenization of tasks so that we can reuse the tokens for next N tokens prediction.