This is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy!
Conferences where MLSys papers get published
- Attention is all you need: Start here, Still one of the best intros
- Online normalizer calculation for softmax: A must read before reading the flash attention. Will help you get the main "trick"
- Self Attention does not need O(n^2) memory:
- Flash Attention 2: The diagrams here do a better job of explaining flash attention 1 as well
- Llama 2 paper: Skim it for the model details
- gpt-fast: A great repo to come back to for minimal yet performant code
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: There's tons of papers on long context lengths but I found this to be among the clearest
- Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Wonderful survey, start here
- Efficiently Scaling transformer inference: Introduced many ideas most notably KV caches
- Making Deep Learning go Brrr from First Principles: One of the best intros to fusions and overhead
- Fast Inference from Transformers via Speculative Decoding: This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding
- Group Query Attention: KV caches can be chunky this is how you fix it
- Orca: A Distributed Serving System for Transformer-Based Generative Models: introduced continuous batching (great pre-read for the PagedAttention paper).
- Efficient Memory Management for Large Language Model Serving with PagedAttention: the most crucial optimization for high throughput batch inference
- Colfax Research Blog: Excellent blog if you're interested in learning more about CUTLASS and modern GPU programming
- Sarathi LLM: Introduces chunked prefill to make workloads more balanced between prefill and decode
- Epilogue Visitor Tree: Fuse custom epilogues by adding more epilogues to the same class (visitor design pattern) and represent the whole epilogue as a tree
- A White Paper on Neural Network Quantization: Start here this is will give you the foundation to quickly skim all the other papers
- LLM.int8: All of Dettmers papers are great but this is a natural intro
- FP8 formats for deep learning: For a first hand look of how new number formats come about
- Smoothquant: Balancing rounding errors between weights and activations
- Mixed precision training: The OG paper describing mixed precision training strategies for half
- RoFormer: Enhanced Transformer with Rotary Position Embedding: The paper that introduced rotary positional embeddings
- YaRN: Efficient Context Window Extension of Large Language Models: Extend base model context lengths with finetuning
- Ring Attention with Blockwise Transformers for Near-Infinite Context: Scale to infinite context lengths as long as you can stack more GPUs
- Venom: Vectorized N:M Format for sparse tensor cores when hardware only supports 2:4
- Megablocks: Efficient Sparse training with mixture of experts
- ReLu Strikes Back: Really enjoyed this paper as an example of doing model surgery for more efficient inference
- Singularity: Shows how to make jobs preemptible, migratable and elastic
- Local SGD: So hot right now
- OpenDiloco: Asynchronous training for decentralized training
- torchtitan: Minimal repository showing how to implement 4D parallelism in pure PyTorch
- pipedream: The pipeline parallel paper
- jit checkpointing: a very clever alternative to periodic checkpointing
- Reducing Activation Recomputation in Large Transformer models: THe paper thatt introduced selective activation checkpointing and goes over activation recomputation strategies
- Breaking the computation and communication abstraction barrier: God tier paper that goes over research at the intersection of distributed computing and compilers to maximize comms overlap
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism.
- Megatron-LM: For an introduction to Tensor Parallelism