Paper List for Machine Learning Systems

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

[HotInfra'24] Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines
[arxiv'24] TensorSocket: Shared Data Loading for Deep Learning Training
[arxiv'24] Efficient Tabular Data Preprocessing of ML Pipelines
[arxiv'24] cedar: Composable and Optimized Machine Learning Input Data Pipelines
[MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
[ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
[SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
[VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
[VLDB'21] tf.data: A Machine Learning Data Processing Framework

1.1.2 Prep stalls

[ATC'24] Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
[HotStorage'24] A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL Training
[VLDB'24] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
[arxiv'23] Rinas: Training with Dataset Shuffling Can Be General and Fast
[CVPR'23] FFCV: Accelerating Training by Removing Data Bottlenecks
[RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
[SIGMOD'23] GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning
[VLDB'23] FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
[SoCC'23] tf.data service: A Case for Disaggregating ML Input Data Processing
- arxiv version
[ATC'22] Cachew: Machine Learning Input Data Processing as a Service
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters
[ICPP'19] DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines

1.1.3 Fetch stalls (I/O)

[TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
[ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
[SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O

1.1.4 Specific workloads (GNN, DLRM)

[VLDB'25] Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression
[ISCA'24] PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models
[arxiv'23] Towards Data-centric Graph Machine Learning: Review and Outlook
[arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
[MLSys'23] RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
[ASPLOS'22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
[RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
[arxiv'23] MTrainS: Improving DLRM training efficiency using heterogeneous memories
[SOSP'23] Bagpipe: Accelerating Deep Recommendation Model Training
[SOSP'23] gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning
[NSDI'23] BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
[DAC'22] A Joint Management Middleware to Improve Training Performance of Deep Recommendation Systems with SSDs
[VLDB'22] Accelerating Recommendation System Training by Leveraging Popular Choices

1.2 Caching and Distributed storage for ML training

[TPDS'23] High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
[SOSP'23] UGACHE: A Unified GPU Cache for Embedding-based Deep Learning
[ATC'23] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 2.1]
[FAST'23] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
[HPCA'23] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
[NeurIPS'22] A Deep Learning Dataloader with Shared Data Preparation
[CLUSTER'22] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
[ICDE'22] Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
[ATC'21] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training
[FAST'20] Quiver: An Informed Storage Cache for Deep Learning
[ICPP'20] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
[arXiv'19] Faster Neural Network Training with Data Echoing
[HotCloud'19] The Case for Unifying Data Loading in Machine Learning Clusters

1.3 Data formats

[ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
[VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

1.4 Data pipeline fairness and correctness

[CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

1.5 Data labeling automation

[VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

2. Training System

2.1 Empirical study on ML Jobs

[ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
[NSDI'24] Characterization of Large Language Model Development in the Datacenter
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)

2.2 DNN job scheduling

[SoCC'24] Kale: Elastic GPU Scheduling for Online DL Model Training
[arxiv'24] Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
[SC'24] PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
[ASPLOS'24] Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
[Middleware'24] Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters
[IPDPS'24] Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster
[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
[Survey 🔍] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
[Survey 🔍] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (Synergy)
[SIGCOMM'22] Multi-resource interleaving for deep learning training (Muri)
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (Helios)
[OSDI'21] Privacy Budget Scheduling (DPF)
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (AFS)
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (GandivaFair)
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (Gavel)
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
[MLSys'20] Resource Elasticity in Distributed Deep Learning
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning

2.3 GPU sharing

[SC'24] ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments
[arxiv'24] Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
[arxiv'24] Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
[ICPP'24] MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning Clusters
[ASPLOS'24] RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing
[EuroSys'24] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
[ATC'23] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
[NSDI'23] Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
[ICPP'23] FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
[arxiv'23] GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning
[arxiv'23] MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
[SoCC'22] MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
[PACT'22] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
[ATC'21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
[MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
[OSDI'20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
[OSDI'20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
[RTAS'19] Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs

2.4 GPU memory management and optimization

[arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
[TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
[arxiv'24] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
[ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
[arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
[arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (QSDP)
[arxiv'23] Does compressing activations help model parallel training?
[SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
[VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
[HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
[HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
[IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
[ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
[VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
[ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
[ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
[ICLR'21] Dynamic Tensor Rematerialization
[SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
[HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
[MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
[ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
[ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
[SC'20] ZeRO: memory optimizations toward training trillion parameter models
[ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
[PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
[MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
[arxiv'16] Training Deep Nets with Sublinear Memory Cost

2.5 GPU memory usage estimate

[ESEC/FSE'20] Estimating GPU memory consumption of deep learning models

2.6 Distributed training (Parallelism)

[ASPLOS'25 (to appear)] GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
[SOSP'24] Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
[arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
[TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
[NeurIPS'24] Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models
[NeurIPS'24] SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation
[SC'24] Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching
[SC'24] Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
[arxiv'24] BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
[arxiv'24] Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
[SoCC'24] Distributed training of large language models on AWS Trainium
[arxiv'24] SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
[TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
[arxiv'24] Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
[arxiv'24] FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
[arxiv'24] FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
[arxiv'24] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
[arxiv'24] PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
[arxiv'24] Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters
[arxiv'24] DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models
[SOSP'24] TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections
[arxiv'24] Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
[arxiv'24] FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment
[arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
[arxiv'24] Unicron: Economizing Self-Healing LLM Training at Scale
[arxiv'24] TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading
[ICPP'24] AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU Cluster
[arxiv'24] Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
[Survey 🔍] [arxiv'24] Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
[COLM'24] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
[OSDI'24] nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
- [arxiv'23] SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
[ATC'24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
[ATC'24] Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
[ATC'24] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
[ATC'24] OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model
[arxiv'24] LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
[arxiv'24] PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
[HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
[ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
[ICML'24] Integrated Hardware Architecture and Device Placement Search
[MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
[MobiCom'24] Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices
[EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
[EuroMLSys@EuroSys'24] ML Training with Cloud GPU Shortages: Is Cross-Region the Answer?
[ASPLOS'24] AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
[ASPLOS'24] PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
[EuroSys'24] Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
[arxiv'24] BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
[arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
[arxiv'24] Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
[arxiv'24] GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models
[ICLR'24] Zero Bubble (Almost) Pipeline Parallelism
[arxiv'24] BitDelta: Your Fine-Tune May Only Be Worth One Bit
[arxiv'24] NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
[arxiv'24] Accelerating Parallel Sampling of Diffusion Models
[arxiv'24] Training DNN Models over Heterogeneous Clusters with Optimal Performance
[TKDE'24] Improving Automatic Parallel Training via Balanced Memory Workload Optimization
- extended version of Galvatron (VLDB'23)
- arxiv version (2023): link
[NSDI'24] DISTMM: Accelerating Distributed Multi-modal Model Training
[NSDI'24] Accelerating Neural Recommendation Training with Embedding Scheduling
[NSDI'24] Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
[NSDI'24] QuickUpdate: a Real-Time Personalization System for Large-Scale Recommendation Models
[NSDI'24] Scaling Large Language Model Training to More Than 10,000 GPUs
[arxiv'24] Breaking MLPerf Training: A Case Study on Optimizing BERT
[ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
- arxiv openreview
[arxiv'24] LocMoE: A Low-overhead MoE for Large Language Model Training
[arxiv'24] Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
[AAMAS'24] Holonic Learning: A Flexible Agent-based Distributed Machine Learning Framework
[arxiv'24] InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding
[VLDB'24] Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
[HPCA'24] Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
[NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
[EuroSys'24] HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
[ICPP'23] Mercury: Fast and Optimal Device Placement for Large Deep Learning Models
[arxiv'23] vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training
[arxiv'23] ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
[arxiv'23] FlexModel: A Framework for Interpretability of Distributed Large Language Models
[arxiv'23] Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
[arxiv'23] RTP: Rethinking Tensor Parallelism with Memory Deduplication
[arxiv'23] FP8-LM: Training FP8 Large Language Models
[arxiv'23] Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
[arxiv'23] DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
[arxiv'23] A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
[arxiv'23] FLM-101B: An Open LLM and How to Train It with $100K Budget
[arxiv'23] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
[arxiv'23] Modeling Parallel Programs using Large Language Models
[arxiv'23] Proteus: Simulating the Performance of Distributed DNN Training
[arxiv'23] Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training
[arxiv'23] Decoupled Model Schedule for Deep Learning Training
[arxiv'23] RAF: Holistic Compilation for Deep Learning Model Training
[arxiv'23] Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches
[arxiv'23] Does compressing activations help model parallel training?
[arxiv'23] Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
[arxiv'23] Scaling Vision Transformers to 22 Billion Parameters
[arxiv'23] Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
[arxiv'23] TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation
[arxiv'23] SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
[arxiv'23] ATP: Adaptive Tensor Parallelism for Foundation Models
[IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
[NeurIPS'23] ASPEN: Breaking Operator Barriers for Efficient Parallelization of Deep Neural Networks
[NeurIPS'23] DeepPCR: Parallelizing Sequential Operations in Neural Networks
[DAC'23] MixPipe: Efficient Bidirectional Pipeline Parallelism for Training Large-Scale Models
[SC'23] Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
[SOSP'23] PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
[SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
[TPDS'23] Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models
[HPCA'23] Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism
[ACL'23] Sequence Parallelism: Long Sequence Training from System Perspective
[CCGrid'23] A Deep Learning Pipeline Parallel Optimization Method
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
[ATC'23] MSRL: Distributed Reinforcement Learning with Dataflow Fragments
[Survey 🔍] [TPDS'23] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
[ICML'23] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
[ICML'23] BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
[NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
[NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
[NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
[MLSys'23] On Optimizing the Communication of Model Parallelism
[MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
[TPDS'23] Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
[PPoPP'23] Elastic Averaging for Efficient Pipelined DNN Training
[PPoPP'23] Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
[VLDB'23] MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
[VLDB'23] Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
[ASPLOS'23] Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression
[arxiv'22] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
[arxiv'22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
[ICPP'22] Tesseract: Parallelize the Tensor Parallelism Efficiently
[MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning
- arxiv
[NeurIPS'22] Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees
[SoCC'22] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
[MLSys'22] Pathways: Asynchronous distributed dataflow for ML
[MLSys'22] SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud
[MLSys'22] Efficient Strong Scaling Through Burst Parallel Training
[EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
[ATC'22] Whale: Efficient Giant Model Training over Heterogeneous GPUs
[NeurIPS'22] AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
[ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
[HPDC'22] Hare: Exploiting Inter-job and Intra-job Parallelism of Distributed Machine Learning on Heterogeneous GPUs
[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
[NSDI'22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks
[arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
[arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
[JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
[TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
[ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
[SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
[MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
[ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
[NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
[ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
[ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
[ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
[SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
[SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (PTD-P or Megatron-LM v2)
[FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
[PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
[VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
[HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
[NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
[arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
[KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
[VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
[OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (BytePS)
[SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
[NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
- arxiv
[arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]
[HPCA'19] HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
[IEEE MICRO'19] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
[MLSys'19] Beyond data and model parallelism for deep neural networks (FlexFlow)
[MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
[EuroSys'19] Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
[EuroSys'19] Supporting Very Large Models using Automatic Dataflow Graph Partitioning (Tofu)
[SOSP'19] A Generic Communication Scheduler for Distributed DNN Training Acceleration
[NeurIPS'19] Mesh-TensorFlow: Deep Learning for Supercomputers
[NeurIPS'19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
[ICML'18] Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
[Survey 🔍] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
[Survey 🔍] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
[Survey 🔍] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools

2.7 DL job failures / Fault tolerance (resilient training)

[arxiv'24] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
[arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
[arxiv'24] ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
[arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
[arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
[SOSP'24] ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
[HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
[EuroSys'24] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
[NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
[arxiv'23] Unicron: Economizing Self-Healing LLM Training at Scale
[VLDB'23] Eficient Fault Tolerance for Recommendation Model Training via Erasure Coding
[SOSP'23] GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
[SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
[NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
[EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
[ATC'22] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure
[MLSys'21] Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
[FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
[ICSE'20] An Empirical Study on Program Failures of Deep Learning Jobs

2.8 AutoML

[OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
[NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
[OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

2.9 Communication optimization & Network Infrastructure for ML

[SC'24] Optimizing Distributed ML Communication with Fused Computation-Collective Operations
[SC'24] Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
[NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
[arxiv'24] LumosCore: Highly Scalable LLM Clusters with Optical Interconnect
[TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
[HOTI'24] Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives
[SC'24] Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration
[HPDC'24] Near-Optimal Wafer-Scale Reduce
[arxiv'24] HiCCL: A Hierarchical Collective Communication Library
[ICS'24] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
[ICS'24] Snoopie: A Multi-GPU Communication Profiler and Visualizer
[arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
[arxiv'24] Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
[arxiv'24] Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
[arxiv'24] Demystifying the Communication Characteristics for Distributed Transformer Models
[ICPP'24] Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning
[NAIC @ SIGCOMM'24] Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
[NAIC @ SIGCOMM'24] Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
[NAIC @ SIGCOMM'24] OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs
[SIGCOMM'24] Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
[SIGCOMM'24] RDMA over Ethernet for Distributed Training at Meta Scale
[SIGCOMM'24] Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
[SIGCOMM'24] MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
[arxiv'24] MLTCP: Congestion Control for DNN Training
- [HotNets'24] MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning
[arxiv'24] Accelerating Distributed Deep Learning using Lossless Homomorphic Compression
[arxiv'24] ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics
[APNet'24] Understanding Communication Characteristics of Distributed Training
[ICLR'24] ZeRO++: Extremely Efficient Collective Communication for Large Model Training
[ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
- [arxiv] [openreview]
[MLSys'24] L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning
[MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
[ASPLOS'24] T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
[ASPLOS'24] TCCL: Discovering Better Communication Paths for PCIe GPU Clusters
[ASPLOS'24] Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
[ASPLOS'24] Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM
[NSDI'24] THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
[Survey 🔍] [arxiv'23] Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
[arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
[arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
[arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
[arxiv'23] Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training
[arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
[arxiv'23] TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training
[INFOCOM'23] Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks
[ICDCS'23] bbTopk: Bandwidth-Aware Sparse Allreduce with Blocked Sparsification for Efficient Distributed Training
[ICML'23] CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
- Related to DT-FM (NeurIPS'22)
[IPDPS'23] MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
[ASPLOS'23] MSCCLang: Microsoft Collective Communication Language
[ASPLOS'23] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
[EuroSys'23] A2TP: Aggregator-aware In-network Aggregation for Multi-tenant Learning
[EuroSys'23] Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
[MLSys'23] Cupcake: A Compression Optimizer for Scalable Communication-Efficient Distributed Training
[MLSys'23] On Optimizing the Communication of Model Parallelism
[NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
[NSDI'23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
[NSDI'23] TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
[EuroSys'22] Out-of-order backprop: an effective scheduling technique for deep learning
[ISCA'22] Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models
[SC'22] HammingMesh: A Network Topology for Large-Scale Deep Learning
[PPoPP'22] Near-optimal sparse allreduce for distributed deep learning
[MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning (P^2)
[ASPLOS'22] Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads (CoCoNET)
[EuroSys'21] DGCL: an efficient communication library for distributed GNN training
[ICLR'21] Multi-Level Local SGD for Heterogeneous Hierarchical Networks
[SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.5]
[SC'21] Flare: flexible in-network allreduce
[NSDI'21] Scaling Distributed Machine Learning with In-Network Aggregation
[ISCA'21] Enabling compute-communication overlap in distributed deep learning training platforms
[PPoPP'21] Synthesizing optimal collective algorithms (SCCL)
[SIGCOMM'21] SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training
[ISCA'20] An in-network architecture for accelerating shared-memory multiprocessor collectives
[NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
[PPoPP'20] Taming unbalanced training workloads in deep learning with partial collective operations
[MLSys'20] Blink: Fast and Generic Collectives for Distributed ML
[MLSys'20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training
[OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (BytePS)
[MLSys'19] Priority-based Parameter Propagation for Distributed DNN Training (P3)
[MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
[SOSP'19] A generic communication scheduler for distributed DNN training acceleration (ByteScheduler)
[ATC'17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

2.10 DNN compiler

[SOSP'24] Scaling Deep Learning Computation over the Inter-core Connected Intelligence Processor with T10
[OSDI'23] Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
[OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
[OSDI'23] Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
[OSDI'23] EINNET: Optimizing Tensor Programs with Derivation-Based Transformations
[OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm
[OSDI'22] ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
[OSDI'20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
[OSDI'20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
[ASPLOS'20] FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
[OSDI'18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

2.11 Model pruning and compression

[ACL'23] Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
[ICLR'23] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
[OSDI'23] AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
[ICML'22] TSPipe: Learn from Teacher Faster with Pipelines

2.12 GNN training system

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

[arxiv'24] FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
[ICPP'24] GNNDrive: Reducing Memory Contention and I/O Congestion for Disk-based GNN Training
[VLDB'24] NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams
[arxiv'23] ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training
[arxiv'23] Helios: An Efficient Out-of-core GNN Training System on Terabyte-scale Graphs with In-memory Performance
[arxiv'23] GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism
[MLSys'23] Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
[SIGMOD'23] DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
[EuroSys'23] MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks
[KDD'22] Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs
[VLDB'22] TGL: a general framework for temporal GNN training on billion-scale graphs
[OSDI'21] P3: Distributed Deep Graph Learning at Scale

2.13 Congestion control for DNN training

[arxiv'24] MLTCP: Congestion Control for DNN Training
[HotNets'22] Congestion Control in Machine Learning Clusters

2.14 Others

[ATC'24] Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor

3. Inference System

[arxiv'24] SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration, (Code)
[arxiv'24] SageAttention2 Technical Report: Accurate 4-Bit Attention for Plug-and-play Inference Acceleration, (Code)
[arxiv'24] Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
[ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
[arxiv'24] EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
[IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
[arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
[NeurIPS'24] Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting
[NeurIPS'24] Toward Efficient Inference for Mixture of Experts
[arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
[SC'24] PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
[NeurIPS'24 (spotlight)] Sequoia: Scalable and Robust Speculative Decoding
[SC'24] SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing
[arxiv'24] SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
[arxiv'24] V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM
[SenSys'24] LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning
[arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
[arxiv'24] NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
[arxiv'24] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
[MICRO'24] Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
[arxiv'24] VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
[arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
[arxiv'24] Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
[arxiv'24] POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
[arxiv'24] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
[arxiv'24] MagicPIG: LSH Sampling for Efficient LLM Generation
[arxiv'24] Revisiting SLO and Goodput Metrics in LLM Serving
[arxiv'24] EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
[arxiv'24] ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
[EuroSys'25] Fast State Restoration in LLM Serving with HCache
[arxiv'24] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
[arxiv'24] TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
[arxiv'24] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
[arxiv'24] Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
[arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
[arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
[HPCA'24] KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers
[arxiv'24] Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
[arxiv'24] Efficient LLM Scheduling by Learning to Rank
[arxiv'24] P/D-Serve: Serving Disaggregated Large Language Model at Scale
[arxiv'24] NanoFlow: Towards Optimal Large Language Model Serving Throughput
[arxiv'24] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
[SOSP'24] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
[SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
[SOSP'24] Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
[SOSP'24] Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
[arxiv'24] LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
[ICPP'24] GMM: An Efficient GPU Memory Management-based Model Serving System for Multiple DNN Inference Models
[SIGCOMM'24] CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
[ES-FoMO @ ICML'24] CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models
[OSDI'24] dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
[OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
[OSDI'24] USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
[OSDI'24] Fairness in Serving Large Language Models
[OSDI'24] MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
[OSDI'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
[OSDI'24] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
[OSDI'24] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
[OSDI'24] Llumnix: Dynamic Scheduling for Large Language Model Serving
[OSDI'24] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
[ATC'24] Power-aware Deep Learning Model Serving with μ-Serve
[ATC'24] Fast Inference for Probabilistic Graphical Models
[ATC'24] Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
[ATC'24] PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
[ATC'24] Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
[TPDS'24] ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG
[Survey 🔍] [arxiv'24] LLM Inference Serving: Survey of Recent Advances and Opportunities
[arxiv'24] Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
[arxiv'24] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
[arxiv'24] One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
[OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
[arxiv'24] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
[ISCA'24] ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models
[ISCA'24] Splitwise: Efficient generative LLM inference using phase splitting
[ICML'24] HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
[ICML'24] MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
[HPCA'24] An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models
[arxiv'24] Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
[MobiSys'24] ARISE: High-Capacity AR Offloading Inference Serving via Proactive Scheduling
[MobiSys'24] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
[arxiv'24] Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
[arxiv'24] HawkVision: Low-Latency Modeless Edge AI Serving
[MLSys'24] HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
[MLSys'24] S-LoRA: Serving Thousands of Concurrent LoRA Adapters
[MLSys'24] Vidur: A Large-Scale Simulation Framework For LLM Inference
[arxiv'24] The CAP Principle for LLM Serving
[WWW'24] λGrapher: A Resource-Efficient Serverless System for GNN Serving through Graph Sharing
[arxiv'24] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
[ICML'24] CLLMs: Consistency Large Language Models
[arxiv'24] BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
[EuroSys'24] Model Selection for Latency-Critical Inference Serving
[arxiv'24] Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
[arxiv'24] Learn To be Efficient: Build Structured Sparsity in Large Language Models
[arxiv'24] Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
[ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
[arxiv'24] Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding
[arxiv'24] ALTO: An Efficient Network Orchestrator for Compound AI Systems
[ASPLOS'24] ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
[ASPLOS'24] NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
[arxiv'24] ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys
[arxiv'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
[ICML'24] DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
[ICLR'24] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
[arxiv'24] FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
[arxiv'24] Wisdom of Committee: Distilling from Foundation Model to SpecializedApplication Model
[arxiv'24] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
[arxiv'24] LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
- [PPoPP'24 poster] POSTER: LLM-PQ:Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
[NSDI'24] Approximate Caching for Efficiently Serving Diffusion Models
[arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
[arxiv'24] ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
[arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
[arxiv'24] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
[arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
[arxiv'24] CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
[arxiv'24] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
[arxiv'24] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
[Survey 🔍] [arxiv'24] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
[arxiv'24] Learned Best-Effort LLM Serving
[arxiv'24] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
[VLDB'24] Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
[ASPLOS'24] SpotServe: Serving Generative Large Language Models on Preemptible Instances
[ASPLOS'24] SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification
[arxiv'23] DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
[EMNLP'23] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
[arxiv'23] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
[arxiv'23] Fairness in Serving Large Language Models
[arxiv'23] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
[arxiv'23] Punica: Multi-Tenant LoRA Serving
[arxiv'23] Pipeline Parallelism for DNN Inference with Practical Performance Guarantees
[arxiv'23] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
[arxiv'23] High-throughput Generative Inference of Large Language Models with a Single GPU
[HPDC'23] Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
[SOSP'23] Paella: Low-latency Model Serving with Virtualized GPU Scheduling
[SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
[MLSys'23] Efficiently Scaling Transformer Inference
[EuroSys'23] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access
[EuroSys'23] Tabi: An Efficient Multi-Level Inference System for Large Language Models
[EuroSys'23] Pocket: ML Serving from the Edge
[OSDI'23] AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
[NSDI'23] SHEPHERD: Serving DNNs in the Wild
[VLDB'23] Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
[ICML'23] Fast Inference from Transformers via Speculative Decoding
[SIGMOD'22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving
[OSDI'22] Orca: A Distributed Serving System for Transformer-Based Generative Models
[OSDI'22] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
[ATC'22] SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
[ATC'22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
[ATC'22] Tetris: Memory-efficient Serverless Inference through Tensor Sharing
[ATC'22] PetS: A Unified Framework for Parameter-Efficient Transformers Serving
[ATC'21] INFaaS: Automated Model-less Inference Serving
[SoCC'21] Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving
[arxiv'21] Supporting Massive DLRM Inference through Software Defined Memory
[MobiCom'20] SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud

4. Mixture of Experts (MoE)

This is the list of papers about MoE training and inference (collected from 2.6 and 3).

[ML for Sys workshop @ NeurIPS'24] IFMoE: An Inference Framework Design for Fine-grained MoE
[ML for Sys workshop @ NeurIPS'24] TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation
[arxiv'24] Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
[arxiv'24] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
[MLSys'24] SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
[arxiv'24] Pro-Prophet: Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models
[EMNLP'24] Mixture of Diverse Size Experts
[ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
[SoCC'24] MoEsaic: Shared Mixture of Experts
[KDD'24] Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing
[arxiv'24] Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
[IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
[arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
[arxiv'24] Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
[NeurIPS'24] Toward Efficient Inference for Mixture of Experts
[arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
[MLSys'24] SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
[SC'24] APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes
[NeurIPS'24] GraphMETRO: Mitigating Complex Graph Distribution Shifts via Mixture of Aligned Experts
[arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
[arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
[NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
[arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
[arxiv'24] Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
[NeurIPS'24] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
[arxiv'24] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
[arxiv'24] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
[arxiv'24] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
[arxiv'24] Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
[arxiv'24] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
[arxiv'24] Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
[arxiv'24] MoH: Multi-Head Attention as Mixture-of-Head Attention
[arxiv'24] AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach
[NeurIPS'24 (Splotlight)] Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
[arxiv'24] Aria: An Open Multimodal Native Mixture-of-Experts Model
[arxiv'24] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
[arxiv'24] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
[arxiv'24] Upcycling Large Language Models into Mixture of Experts
[arxiv'24] No Need to Talk: Asynchronous Mixture of Language Models
[arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
[arxiv'24] HMoE: Heterogeneous Mixture of Experts for Language Modeling
[arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
[arxiv'24] AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
[arxiv'24] Layerwise Recurrent Router for Mixture-of-Experts
[arxiv'24] Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
[SRW @ ACL'24] MoExtend: Tuning New Experts for Modality and Task Extension
[arxiv'24] MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts
[arxiv'24] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
[arxiv'24] Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
[arxiv'24] Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
[ICML'24] Scaling Laws for Fine-Grained Mixture of Experts
[ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
[MLSys'24] QMoE: Sub-1-Bit Compression of Trillion-Parameter Models
[MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
[arxiv'24] CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
[arxiv'24] AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts
[SIGIR'24] M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation Framework
[EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
[arxiv'24] MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts
[ICLR'24] Mixture of LoRA Experts
[arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
[arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
[IJCAI'24] LocMoE: A Low-overhead MoE for Large Language Model Training
[ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
[EMNLP'23] Adaptive Gating in Mixture-of-Experts based Language Models
[ACL'23] AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
[arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
[MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
[SustaiNLP @ EMNLP'22] Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production
[NeurIPS'22] Mixture-of-Experts with Expert Choice Routing
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
[ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
[JMLR'22] Switch transformers: scaling to trillion parameter models with simple and efficient sparsity
[ICLR'17] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

5. LLM Long Context

[SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
[arxiv'24] USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
[arxiv'24] Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
[NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
[arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
[arxiv'24] Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
[arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
[COLM'24] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
[arxiv'24] FocusLLM: Scaling LLM's Context by Parallel Decoding
[Survey 🔍] [IJCAI'24] X-former Elucidator: Reviving Efficient Attention for Long Context Language Modeling

6. Federated Learning

[arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
[MLSys'24] LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
[arxiv'24] FedEx: Expediting Federated Learning over Heterogeneous Mobile Devices by Overlapping and Participant Selection
[KDD'24] FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model
[CCGrid'24] Apodotiko: Enabling Efficient Serverless Federated Learning in Heterogeneous Environments
[EuroSys'24] Dordis: Efficient Federated Learning with Dropout-Resilient Differential Privacy
[arxiv'24] Decoupled Vertical Federated Learning for Practical Training on Vertically Partitioned Data
[SAC'24] Training Heterogeneous Client Models using Knowledge Distillation in Serverless Federated Learning
[arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
[arxiv'23] Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
[IMWUT'23] AttFL: A Personalized Federated Learning Framework for Time-series Mobile and Embedded Sensor Data Processing
[Survey 🔍] [FGCS'23] Model aggregation techniques in federated learning: A comprehensive survey
[SoCC'23] Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering
[MLSys'23] GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning
[WWW'23] To Store or Not? Online Data Selection for Federated Learning with Limited Storage
[EuroSys'23] REFL: Resource-Efficient Federated Learning
[VLDB'23] FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
[RecSys'22] Towards Fair Federated Recommendation Learning: Characterizing the Inter-Dependence of System and Data Heterogeneity
[TMLR'22] Optimal Client Sampling for Federated Learning
[ICML'22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
[MobiSys'22] FedBalancer: data and pace control for efficient federated learning on heterogeneous clients
[MobiCom'22] PyramidFL: A Fine-grained Client Selection Framework for Efficient Federated Learning
[MLSys'22] PAPAYA: Practical, Private, and Scalable Federated Learning
[AISTATS'22] Federated Learning with Buffered Asynchronous Aggregation
[NeurIPS'21] Federated Reconstruction: Partially Local Federated Learning
[NeurIPS'21] FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout
[OSDI'21] Oort: Efficient Federated Learning via Guided Participant Selection
[MICRO'21] AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning
[MLSys'19] Towards Federated Learning at Scale: System Design
[Survey 🔍] [ACM CSUR'22] Federated Learning for Smart Healthcare: A Survey

7. Privacy-Preserving ML

[DAC'23] Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators
[ICLR'23] MPCFormer: fast, performant and private Transformer inference with MPC
[NeurIPS'22] Iron: Private Inference on Transformers

8. ML APIs & Application-side Optimization

[arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
[OSDI'24] ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications
[ICML'22] Efficient Online ML API Selection for Multi-Label Classification Tasks (FrugalMCT)
[NeurIPS'20] FrugalML: How to use ML Prediction APIs more accurately and cheaply

9. ML (LLM) for Systems

[ICSE'25] Large Language Models as Configuration Validators
[NeurIPS'24] IaC-Eval: A code generation benchmark for Infrastructure-as-Code programs
[arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
[arxiv'24] LLMTune: Accelerate Database Knob Tuning with Large Language Models
[SIGCOMM'24] NetLLM: Adapting Large Language Models for Networking
[arxiv'24] LLM-Enhanced Data Management
[arxiv'24] MPIrigen: MPI Code Generation through Domain-Specific Language Models
[arxiv'24] Can Large Language Models Write Parallel Code?
[arxiv'23] LLM-Assisted Code Cleaning For Training Accurate Code Generators
[arxiv'23] Large Language Models for Compiler Optimization
[VLDB'23] How Large Language Models Will Disrupt Data Management

10. GPU kernel scheduling

[arxiv'24] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs
[RTAS'24] Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management
- slides: link
[OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
[arxiv'21] Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads
[SIGMETRICS'21] Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels
[NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
[RTSS'17] GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed

11 Energy-efficiency for LLM (carbon-aware)

[arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
[SOSP'24] Perseus: Removing Energy Bloat from Large Model Training
[arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
[ATC'23] EnvPipe: Performance-preserving DNN Training Framework for Saving Energy
[NSDI'23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Others

[CPAL'24 (PMLR)] Jaxpruner: A Concise Library for Sparsity Research
[arxiv'24] Scorch: A Library for Sparse Deep Learning
[arxiv'24] Drowning in Documents: Consequences of Scaling Reranker Inference
[arxiv'24] Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions
[arxiv'24] Computational Bottlenecks of Training Small-scale Large Language Models
[Survey 🔍] [arxiv'24] A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness
[arxiv'24] AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
[ASPLOS'25 (to appear)] PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption
[NeurIPS'24] Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
[NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
[arxiv'24] Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
[arxiv'24] DroidSpeak: Enhancing Cross-LLM Communication
[arxiv'24] Disaggregating Embedding Recommendation Systems with FlexEMR
[arxiv'24] JudgeBench: A Benchmark for Evaluating LLM-based Judges
[VLDB'25] Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
[arxiv'24] You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
[arxiv'24] Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
[Survey 🔍] [arxiv'24] A Survey of Resource-efficient LLM and Multimodal Foundation Models
[arxiv'23] Efficiently Programming Large Language Models using SGLang
[MICRO'23] Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads

References

This repository is motivated by:

https://github.com/HuaizhengZhang/Awesome-System-for-Machine-Learning
https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
https://github.com/ganler/ResearchReading
https://jeongseob.github.io/readings_mlsys.html
https://github.com/chwan1016/awesome-gnn-systems
https://github.com/ConnollyLeon/awesome-Auto-Parallelism

Files

README.md

Latest commit

History

README.md

File metadata and controls

Paper List for Machine Learning Systems

Table of Contents

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

1.1.2 Prep stalls

1.1.3 Fetch stalls (I/O)

1.1.4 Specific workloads (GNN, DLRM)

1.2 Caching and Distributed storage for ML training

1.3 Data formats

1.4 Data pipeline fairness and correctness

1.5 Data labeling automation

2. Training System

2.1 Empirical study on ML Jobs

2.2 DNN job scheduling

2.3 GPU sharing

2.4 GPU memory management and optimization

2.5 GPU memory usage estimate

2.6 Distributed training (Parallelism)

2.7 DL job failures / Fault tolerance (resilient training)

2.8 AutoML

2.9 Communication optimization & Network Infrastructure for ML

2.10 DNN compiler

2.11 Model pruning and compression

2.12 GNN training system

2.13 Congestion control for DNN training

2.14 Others

3. Inference System

4. Mixture of Experts (MoE)

5. LLM Long Context

6. Federated Learning

7. Privacy-Preserving ML

8. ML APIs & Application-side Optimization

9. ML (LLM) for Systems

10. GPU kernel scheduling

11 Energy-efficiency for LLM (carbon-aware)

Others

References