Add GPT2 tokenizer setup and refactor code #56

aliciafmachado · 2025-01-27T08:55:16Z

Main changes introduced are:

Add support for using a pre-trained tokenizer from gpt-tokenizer library.
Refactor token_gemb.ts file to make it simpler to extend it to other tokenizations.
Add common functions among gpt2.ts and transformer_gtensor.ts to a common file.

Fixes Issues #54 & #52.

Refactoring the token embedding type and wrapping it with a class will be done in a follow-up CL.
There is also some work left on refactoring the tokenization of tasks so that we can reuse the tokens for next N tokens prediction.

iislucas

thanks for cleanup also! Just a couple of small minor improvements.

iislucas · 2025-01-28T08:57:11Z

animated-transformer/src/lib/tokens/token_gemb.spec.ts

  embedBatch,
  expectedOutputSeqPrepFn,
 } from '../tokens/token_gemb';

+function tokenize_fn_test(input: string): number[] {


I'm not sure I get what this function is for, can you add a comment to say a little about what it's intended to do.

It's a function for testing. Added a comment and simplified it.

animated-transformer/src/lib/tokens/token_gemb.ts

…ing the local task token rep. There seems to be a memory leak, and a refactoring is necessary.

…ngle file. Refactor token gemb file so that embedBatch is adapted to a use case where a tokenizer is available.

…le with next N tokens prediction.

iislucas

LGTM! Thanks!

aliciafmachado requested a review from iislucas January 27, 2025 08:55

aliciafmachado force-pushed the gpt-tokenizer branch from 3d0d0c6 to 050e136 Compare January 27, 2025 08:58

iislucas reviewed Jan 28, 2025

View reviewed changes

aliciafmachado added 8 commits February 1, 2025 16:41

Add gpt-tokenizer dependency.

fad792d

Add function for using a standalone tokenizer with GPT2 instead of us…

76d428b

…ing the local task token rep. There seems to be a memory leak, and a refactoring is necessary.

Refactor code so that common functions among transformers are in a si…

f7bced9

…ngle file. Refactor token gemb file so that embedBatch is adapted to a use case where a tokenizer is available.

Refactored computePrediction and computeDecoder so that it's compatib…

37383e5

…le with next N tokens prediction.

Remove unused test function and clean-up comments.

7e5e933

Clarify some TODOs and comments on the code.

d15f675

Add node_modules/gpt-tokenizer to package-lock.json.

352524a

Clarify testing function and simplify mapToIdx and tokenizeAndMapToIdx.

b2fbcc9

aliciafmachado force-pushed the gpt-tokenizer branch from 999531a to b2fbcc9 Compare February 1, 2025 15:42

iislucas approved these changes Feb 5, 2025

View reviewed changes

aliciafmachado merged commit 97f7bbd into PAIR-code:main Feb 5, 2025
1 check passed

aliciafmachado deleted the gpt-tokenizer branch February 5, 2025 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPT2 tokenizer setup and refactor code #56

Add GPT2 tokenizer setup and refactor code #56

aliciafmachado commented Jan 27, 2025 •

edited

Loading

iislucas left a comment

iislucas Jan 28, 2025

aliciafmachado Jan 31, 2025

iislucas left a comment

Add GPT2 tokenizer setup and refactor code #56

Add GPT2 tokenizer setup and refactor code #56

Conversation

aliciafmachado commented Jan 27, 2025 • edited Loading

iislucas left a comment

Choose a reason for hiding this comment

iislucas Jan 28, 2025

Choose a reason for hiding this comment

aliciafmachado Jan 31, 2025

Choose a reason for hiding this comment

iislucas left a comment

Choose a reason for hiding this comment

aliciafmachado commented Jan 27, 2025 •

edited

Loading