Skip to content

Latest commit

 

History

History
53 lines (40 loc) · 2.97 KB

README.md

File metadata and controls

53 lines (40 loc) · 2.97 KB

picoGPT in J

J port of jaymody/picoGPT, "An unnecessarily tiny implementation of GPT-2 in NumPy."

Install

Download J (tested with J903 and base library 9.03.08, see the wiki for installation instructions). You will also need the convert/pjson addon:

NB. If nothing is printed, the addon is already installed
install 'convert/pjson'

Download models

Get the GPT-2 models you want by running the download.sh script (e.g. ./download.sh 124M). If you can't run it, first make a models/ directory and download the tokenizer files into it:

Then, download the model.safetensors and config.json for the model you want and place them in the corresponding models/[model size] directory (e.g. models/124M):

Usage

Run gpt2.ijs (e.g. jconsole gpt2.ijs). Then:

NB. Load model
model '124M'
NB. Generate 40 tokens by default
gen 'Alan Turing theorized that computers would one day become'

NB. Switch model
model '1558M'
NB. Generate 79 tokens. Assign the output to a variable to prevent it from
NB. being printed to the console twice.
out =. 79 gen 'The importance of nomenclature, notation, and language as tools of'

Notes

  • When the input length exceeds n_ctx, rather than throwing an exception, only the last n_ctx tokens are used.
  • Instead of a progress bar, tokens are printed as they're generated.
  • All calculations are done with 64-bit floats since J doesn't have 32-bit floats (not sure about 32-bit J, though).
  • The Safetensors format is used since it's easier to parse. This means checkpoints are downloaded from HuggingFace rather than OpenAI's Azure storage. Filenames are also different:
    • model.ckpt.* -> model.safetensors
    • hparams.json -> config.json
    • encoder.json -> vocab.json
    • vocab.bpe -> merges.txt
  • Thanks to karpathy/minGPT for having a good explanation of the BPE tokenizer.