- Brief historical context (5 min)
- Evolution: FFN → RNN → LSTM → Attention
- Why direct connections matter
- Query/Key/Value framework
- Dot-product attention mathematics
- Softmax and scaling deep dive
- Worked example from start to finish
- Attention pattern visualization
- Masking in attention
- Why single attention head isn't enough
- Parameter matrices breakdown
- Linear projections in detail
- Head dimension calculations
- Pattern specialization per head
- Output aggregation mechanisms
- Parallel computation benefits
- Absolute vs relative positioning
- RoPE mathematics
- Implementation details
- Benefits over other positional approaches
- Scaling considerations
- Quadratic scaling problem
- Memory and computation bounds
- Needle in haystack experiments
- Key metrics for long context
- Path from 2K to 100K+ tokens
- Current SOTA approaches
- Real-world implications
- Core KV caching concept
- Memory savings mathematics
- Implementation deep dive
- Autoregressive generation patterns
- Common pitfalls and solutions
- Performance implications
- IO-aware attention computation
- Tiling strategies
- Memory hierarchy optimization
- Implementation details
- Flash Attention 1 vs 2
- Performance gains
- Local vs global attention patterns
- Sparse attention matrices
- Block-sparse implementations
- Longformer/BigBird approaches
- Trade-offs in long sequences
- Memory scaling problems in attention
- GQA mathematical foundation
- Head grouping strategies
- Memory vs performance tradeoffs
- Implementation considerations
- Real-world benchmarks
- Latent space fundamentals
- Dimensionality reduction mathematics
- Query/key space transformations
- Computational efficiency gains
- Implementation challenges
- Practical considerations
- Linear attention mechanisms
- State space models
- Mamba architecture insights
- Performance characteristics
- When to use linear alternatives
- Attention noise problem
- Dual softmax mechanism in detail
- Lambda parameter dynamics
- GroupNorm's role
- Training considerations
- Performance characteristics
- Empirical improvements
- MoE fundamentals
- Expert routing in attention
- Balancing and gating
- Training considerations
- Scaling properties