(edits in progress)
- Basics
- Applications
- Building & Using
- Transformer architecture
- Example datasets
- GPT architecture
- LLM buildplan
- Word embeddings
- Text tokens
- Tokens --> token IDs
- Special context tokens
- Byte pair encoding
- Sliding-window data sampling
- Token embeddings
- Word position encoding
- The long-sequence problem
- Capturing data dependencies with attention
- Self-attention
- with trainable weights
- weights for all input tokens
- Self-attention with trainable weights
- Computation
- Python class definition
- Causal attention
- masking
- dropout
- Python class definition
- Multihead attention
- stacking single-head attention layers
- weight splits
- Architecture code
- Layer normalization
- Feed-forward nets with GELU (Gaussian error linear unit) activations
- Shortcut connections
- Attention & linear layers in a transformer block
- Model code
- Generating text
- Evaluating generative text models
- Using GPT to generate text
- Text generation loss
- Training & validation set loss
- LLM training
- Reducing randomness
- Temperature scaling
- top-k sampling
- Modifying the text generator
- PyTorch: model file load/save
- Loading pretrained weights from OpenAI
- Instruction- vs Classification-finetuning
- Dataset prep
- Dataloaders
- Initializing with pretrained weights
- Classification head
- Classification loss & accuracy
- Finetuning - supervised data
- LLM as a spam classifier
- TODO
- What is PyTorch?
- 3 core components
- deep learning, defined
- installation
- Tensors
- scalars, vectors, matrices, tensors
- datatypes
- common tensor ops
- Model as computation graphs
- Auto differentiation
- Designing multilayer neural nets
- Designing data loaders
- Typical training loops
- Model load/save
- GPUs and training performance
- PyTorch on GPUs
- Single-GPU training
- Multi-GPU training
- Selecting available GPUs
- Resources
- Exercise answers
-
Chapter 1
-
Chapter 2
-
Chapter 3
-
Chapter 4
-
Chapter 5
- loss functions & log transformations
- Pythia
- OLMo
- Project Gutenberg for LLM training
- Simple and Scalable Pretraining Strategies
- BloombergGPT
- GaLore optimizer
- GaLore code repo
- Dolma: an Open Corpus of Three Trillion Tokens
- The Pile
- RefinedWeb Dataset for Falcon LLM
- RedPajama
- FineWeb dataset from CommonCrawl
- top-k sampling
- Beam search (not cover in chapter 5)
-
Chapter 6
- Finetuning Transformers
- Finetuning LLMs
- More spam classification experiments
- Binary classification using a single output node
- Imbalanced-learn user guide
- spam email classification dataset
- BERT: Pre-training of Deep Bidirectional Transformers
- RoBERTa
- IMDB Movie Reviews sentiment
- causal mask removal
- LLM2Vec
(see jupyter notebook)
- Learning rate warmup
- Cosine decay
- Gradient clipping
- Modified training function
- Intro
- Dataset prep
- Model init
- LoRA
- listing e.5 implementation
- image e.3
- listing e.6 LinearWithLora layer to replace linear layers
- image e.4 architecture
- listing e.7 - finetuning with LoRA layers