Paper list for broad topics in machine learning systems
NOTE: Survey papers are annotated with [Survey 🔍] prefix.
- Paper List for Machine Learning Systems
- Table of Contents
- 1. Data Processing
- 2. Training System
- 2.1 Empirical study on ML Jobs
- 2.2 DNN job scheduling
- 2.3 GPU sharing
- 2.4 GPU memory management and optimization
- 2.5 GPU memory usage estimate
- 2.6 Distributed training (Parallelism)
- 2.7 DL job failures / Fault tolerance (resilient training)
- 2.8 AutoML
- 2.9 Communication optimization & Network Infrastructure for ML
- 2.10 DNN compiler
- 2.11 Model pruning and compression
- 2.12 GNN training system
- 2.13 Congestion control for DNN training
- 2.14 Others
- 3. Inference System
- 4. Mixture of Experts (MoE)
- 5. LLM Long Context
- 6. Federated Learning
- 7. Privacy-Preserving ML
- 8. ML APIs & Application-side Optimization
- 9. ML (LLM) for Systems
- 10. GPU kernel scheduling
- 11 Energy-efficiency for LLM (carbon-aware)
- Others
- References
- [HotInfra'24] Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines
- [arxiv'24] TensorSocket: Shared Data Loading for Deep Learning Training
- [arxiv'24] Efficient Tabular Data Preprocessing of ML Pipelines
- [arxiv'24] cedar: Composable and Optimized Machine Learning Input Data Pipelines
- [MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
- [ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
- [SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
- [VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
- [VLDB'21] tf.data: A Machine Learning Data Processing Framework
- [ATC'24] Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
- [HotStorage'24] A Selective Preprocessing Offloading Framework for Reducing Data Traffic in DL Training
- [VLDB'24] FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation
- [arxiv'23] Rinas: Training with Dataset Shuffling Can Be General and Fast
- [CVPR'23] FFCV: Accelerating Training by Removing Data Bottlenecks
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [SIGMOD'23] GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning
- [VLDB'23] FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline
- [SoCC'23] tf.data service: A Case for Disaggregating ML Input Data Processing
- [ATC'22] Cachew: Machine Learning Input Data Processing as a Service
- [OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters
- [ICPP'19] DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines
- [TACO'23] Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
- [ICPP'22] Lobster: Load Balance-Aware I/O for Distributed DNN Training
- [SC'21] Clairvoyant Prefetching for Distributed Machine Learning I/O
- [VLDB'25] Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression
- [ISCA'24] PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models
- [arxiv'23] Towards Data-centric Graph Machine Learning: Review and Outlook
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [MLSys'23] RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
- [ASPLOS'22] RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
- [RecSys'23] InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
- [arxiv'23] MTrainS: Improving DLRM training efficiency using heterogeneous memories
- [SOSP'23] Bagpipe: Accelerating Deep Recommendation Model Training
- [SOSP'23] gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning
- [NSDI'23] BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
- [DAC'22] A Joint Management Middleware to Improve Training Performance of Deep Recommendation Systems with SSDs
- [VLDB'22] Accelerating Recommendation System Training by Leveraging Popular Choices
- [TPDS'23] High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms
- [SOSP'23] UGACHE: A Unified GPU Cache for Embedding-based Deep Learning
- [ATC'23] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
- [EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 2.1]
- [FAST'23] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
- [HPCA'23] iCACHE: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training
- [NeurIPS'22] A Deep Learning Dataloader with Shared Data Preparation
- [CLUSTER'22] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications
- [ICDE'22] Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
- [ATC'21] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training
- [FAST'20] Quiver: An Informed Storage Cache for Deep Learning
- [ICPP'20] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
- [arXiv'19] Faster Neural Network Training with Data Echoing
- [HotCloud'19] The Case for Unifying Data Loading in Machine Learning Clusters
- [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
- [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data
- [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines
- [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision
- [ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
- [NSDI'24] Characterization of Large Language Model Development in the Datacenter
- [NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (
PAI
) - [ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (
Philly
)
-
[SoCC'24] Kale: Elastic GPU Scheduling for Online DL Model Training
-
[arxiv'24] Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
-
[SC'24] PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters
-
[OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
-
[ASPLOS'24] Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
-
[Middleware'24] Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters
-
[IPDPS'24] Hadar: Heterogeneity-Aware Optimization-Based Online Scheduling for Deep Learning Cluster
-
[EuroSys'24] Blox: A Modular Toolkit for Deep Learning Schedulers
-
[NSDI'24] Swing: Short-cutting Rings for Higher Bandwidth Allreduce
-
[NSDI'24] Towards Domain-Specific Network Transport for Distributed DNN Training
-
[NSDI'24] Vulcan: Automatic Query Planning for Live ML Analytics
-
[NSDI'24] CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
-
[Survey 🔍] [ACM CSUR'23] Deep Learning Workload Scheduling in GPU Datacenters: A Survey
-
[arxiv'23] Energy-Efficient GPU Clusters Scheduling for Deep Learning
-
[SC'23] EasyScale: Accuracy-consistent Elastic Training for Deep Learning
-
[ICPP'23] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel
-
[ICPP'23] Embracing Uncertainty for Equity in Resource Allocation in ML Training
-
[SOSP'23] Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
-
[NSDI'23] Shockwave: Proactive, Fair, and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
-
[EuroSys'23] SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters [also in 1.2]
-
[EuroSys'23] Lyra: Elastic Scheduling for Deep Learning Clusters
-
[EuroSys'23] ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
-
[ASPLOS'23] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
-
[arxiv'22] Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads
-
[Survey 🔍] [arxiv, 2022] Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
-
[SoCC'22] ESCHER: Expressive Scheduling with Ephemeral Resources
-
[NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (
PAI
) -
[OSDI'22] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (
Synergy
) -
[SIGCOMM'22] Multi-resource interleaving for deep learning training (
Muri
) -
[MLSys'21] Wavelet: Efficient DNN Training with Tick-Tock Scheduling
-
[SoCC'21] Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
-
[SC'21] Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (
Helios
) -
[OSDI'21] Privacy Budget Scheduling (
DPF
) -
[NSDI'21] Elastic Resource Sharing for Distributed Deep Learning (
AFS
) -
[OSDI'21] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
-
[EuroSys'20] Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning (
GandivaFair
) -
[NSDI'20] Themis: Fair and Efficient GPU Cluster Scheduling
-
[OSDI'20] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
-
[OSDI'20] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (
Gavel
) -
[EuroSys'20] AlloX: Compute Allocation in Hybrid Clusters
-
[MLSys'20] Resource Elasticity in Distributed Deep Learning
-
[NSDI'19] Tiresias: A GPU Cluster Manager for Distributed Deep Learning
-
[ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (
Philly
) -
[EuroSys'18] Optimus: an efficient dynamic resource scheduler for deep learning clusters
-
[OSDI'18] Gandiva: Introspective Cluster Scheduling for Deep Learning
- [SC'24] ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments
- [arxiv'24] Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
- [arxiv'24] Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
- [ICPP'24] MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning Clusters
- [ASPLOS'24] RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing
- [EuroSys'24] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
- [ATC'23] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
- [NSDI'23] Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
- [ICPP'23] FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
- [arxiv'23] GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning
- [arxiv'23] MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
- [SoCC'22] MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
- [PACT'22] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
- [ATC'21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
- [MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
- [OSDI'20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
- [OSDI'20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
- [RTAS'19] Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs
- [arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
- [TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
- [arxiv'24] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- [ASPLOS'24] GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Quantized Distributed Training of Large Models with Convergence Guarantees (
QSDP
) - [arxiv'23] Does compressing activations help model parallel training?
- [SoCC'23] Towards GPU Memory Efficiency for Distributed Training at Scale
- [VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- [SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
- [HPCA'23] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
- [HPCA'23] Tensor Movement Orchestration in Multi-GPU Training Systems
- [IJCAI'23] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
- [ICLR'22] LoRA: Low-Rank Adaptation of Large Language Models
- algorithmic method for memory efficiency
- [VLDB'22] Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
- [ATC'21] ZeRO-Offload: Democratizing Billion-Scale Model Training
- [ICLR'21] ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
- [ICLR'21] Dynamic Tensor Rematerialization
- [SC'21] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
- [HPCA'21] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning
- [MLSys'20] Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
- [ASPLOS'20] Capuchin: Tensor-based GPU Memory Management for Deep Learning
- [ASPLOS'20] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping
- [SC'20] ZeRO: memory optimizations toward training trillion parameter models
- [ISCA'18] Gist: Efficient Data Encoding for Deep Neural Network Training
- [PPoPP'18] Superneurons: dynamic GPU memory management for training deep neural networks
- [MICRO'16] vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
- [arxiv'16] Training Deep Nets with Sublinear Memory Cost
- [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models
-
[ASPLOS'25 (to appear)] GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
-
[SOSP'24] Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
-
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
-
[arxiv'24] Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
-
[TACO'24] ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management
-
[NeurIPS'24] Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models
-
[NeurIPS'24] SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation
-
[SC'24] Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching
-
[SC'24] Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
-
[arxiv'24] BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
-
[arxiv'24] Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
-
[SoCC'24] Distributed training of large language models on AWS Trainium
-
[arxiv'24] SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
-
[TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
-
[SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
-
[arxiv'24] FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
-
[arxiv'24] FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
-
[arxiv'24] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
-
[arxiv'24] PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
-
[arxiv'24] Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters
-
[SOSP'24] TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections
-
[arxiv'24] Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
-
[arxiv'24] FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment
-
[arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
-
[arxiv'24] Unicron: Economizing Self-Healing LLM Training at Scale
-
[arxiv'24] TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading
-
[ICPP'24] AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU Cluster
-
[arxiv'24] Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
-
[Survey 🔍] [arxiv'24] Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
-
[COLM'24] LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
-
[OSDI'24] nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
-
[ATC'24] Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
-
[ATC'24] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
-
[ATC'24] OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model
-
[arxiv'24] LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
-
[arxiv'24] PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
-
[HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
-
[ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
-
[ICML'24] Integrated Hardware Architecture and Device Placement Search
-
[MobiCom'24] Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices
-
[EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
-
[EuroMLSys@EuroSys'24] ML Training with Cloud GPU Shortages: Is Cross-Region the Answer?
-
[ASPLOS'24] AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
-
[ASPLOS'24] PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
-
[EuroSys'24] Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
-
[arxiv'24] BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
-
[arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
-
[arxiv'24] Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
-
[arxiv'24] GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models
-
[arxiv'24] BitDelta: Your Fine-Tune May Only Be Worth One Bit
-
[arxiv'24] NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
-
[arxiv'24] Accelerating Parallel Sampling of Diffusion Models
-
[arxiv'24] Training DNN Models over Heterogeneous Clusters with Optimal Performance
-
[TKDE'24] Improving Automatic Parallel Training via Balanced Memory Workload Optimization
- extended version of Galvatron (VLDB'23)
- arxiv version (2023): link
-
[NSDI'24] DISTMM: Accelerating Distributed Multi-modal Model Training
-
[NSDI'24] Accelerating Neural Recommendation Training with Embedding Scheduling
-
[NSDI'24] Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
-
[NSDI'24] QuickUpdate: a Real-Time Personalization System for Large-Scale Recommendation Models
-
[NSDI'24] Scaling Large Language Model Training to More Than 10,000 GPUs
-
[arxiv'24] Breaking MLPerf Training: A Case Study on Optimizing BERT
-
[ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
-
[arxiv'24] LocMoE: A Low-overhead MoE for Large Language Model Training
-
[arxiv'24] Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
-
[AAMAS'24] Holonic Learning: A Flexible Agent-based Distributed Machine Learning Framework
-
[VLDB'24] Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
-
[HPCA'24] Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search
-
[NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
-
[EuroSys'24] HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis
-
[ICPP'23] Mercury: Fast and Optimal Device Placement for Large Deep Learning Models
-
[arxiv'23] ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
-
[arxiv'23] FlexModel: A Framework for Interpretability of Distributed Large Language Models
-
[arxiv'23] Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
-
[arxiv'23] RTP: Rethinking Tensor Parallelism with Memory Deduplication
-
[arxiv'23] FP8-LM: Training FP8 Large Language Models
-
[arxiv'23] Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
-
[arxiv'23] FLM-101B: An Open LLM and How to Train It with $100K Budget
-
[arxiv'23] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
-
[arxiv'23] Modeling Parallel Programs using Large Language Models
-
[arxiv'23] Proteus: Simulating the Performance of Distributed DNN Training
-
[arxiv'23] Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training
-
[arxiv'23] Decoupled Model Schedule for Deep Learning Training
-
[arxiv'23] RAF: Holistic Compilation for Deep Learning Model Training
-
[arxiv'23] Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches
-
[arxiv'23] Does compressing activations help model parallel training?
-
[arxiv'23] Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
-
[arxiv'23] Scaling Vision Transformers to 22 Billion Parameters
-
[arxiv'23] Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
-
[arxiv'23] TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation
-
[arxiv'23] SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction
-
[arxiv'23] ATP: Adaptive Tensor Parallelism for Foundation Models
-
[IPDPS'23] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
-
[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
-
[NeurIPS'23] ASPEN: Breaking Operator Barriers for Efficient Parallelization of Deep Neural Networks
-
[NeurIPS'23] DeepPCR: Parallelizing Sequential Operations in Neural Networks
-
[DAC'23] MixPipe: Efficient Bidirectional Pipeline Parallelism for Training Large-Scale Models
-
[SC'23] Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
-
[SOSP'23] PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
-
[SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
-
[HPCA'23] Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism
-
[ACL'23] Sequence Parallelism: Long Sequence Training from System Perspective
-
[CCGrid'23] A Deep Learning Pipeline Parallel Optimization Method
-
[OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
-
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
-
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
-
[ATC'23] MSRL: Distributed Reinforcement Learning with Dataflow Fragments
-
[Survey 🔍] [TPDS'23] A Survey on Auto-Parallelism of Large-Scale Deep Learning Training
-
[ICML'23] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
-
[ICML'23] BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models
-
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
-
[NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
-
[NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
-
[NSDI'23] ARK: GPU-driven Code Execution for Distributed Deep Learning
-
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
-
[MLSys'23] On Optimizing the Communication of Model Parallelism
-
[MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
-
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
-
[TPDS'23] Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
-
[PPoPP'23] Elastic Averaging for Efficient Pipelined DNN Training
-
[PPoPP'23] Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
-
[VLDB'23] MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
-
[VLDB'23] Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
-
[ASPLOS'23] Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
-
[ASPLOS'23] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression
-
[arxiv'22] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
-
[arxiv'22] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
-
[ICPP'22] Tesseract: Parallelize the Tensor Parallelism Efficiently
-
[NeurIPS'22] Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees
-
[SoCC'22] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
-
[MLSys'22] Pathways: Asynchronous distributed dataflow for ML
-
[MLSys'22] SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud
-
[MLSys'22] Efficient Strong Scaling Through Burst Parallel Training
-
[EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
-
[ATC'22] Whale: Efficient Giant Model Training over Heterogeneous GPUs
-
[NeurIPS'22] AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
-
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
-
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
-
[ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
-
[HPDC'22] Hare: Exploiting Inter-job and Intra-job Parallelism of Distributed Machine Learning on Heterogeneous GPUs
-
[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
-
[NSDI'22] Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks
-
[arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
-
[arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
-
[JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
-
[TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
-
[ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
-
[SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
-
[MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
-
[ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
-
[NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
-
[ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
-
[ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
-
[ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
-
[SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
-
[SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (
PTD-P
orMegatron-LM v2
) -
[FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
-
[PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
-
[VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
-
[HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
-
[NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
-
[arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
-
[KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
-
[VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
-
[OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (
BytePS
) -
[SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
-
[NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
-
[arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]
-
[HPCA'19] HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
-
[IEEE MICRO'19] Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
-
[MLSys'19] Beyond data and model parallelism for deep neural networks (
FlexFlow
) -
[MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
-
[EuroSys'19] Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
-
[EuroSys'19] Supporting Very Large Models using Automatic Dataflow Graph Partitioning (
Tofu
) -
[SOSP'19] A Generic Communication Scheduler for Distributed DNN Training Acceleration
-
[NeurIPS'19] Mesh-TensorFlow: Deep Learning for Supercomputers
-
[NeurIPS'19] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
-
[ICML'18] Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
-
[Survey 🔍] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
-
[Survey 🔍] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
-
[Survey 🔍] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools
- [arxiv'24] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
- [arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
- [arxiv'24] ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
- [arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
- [arxiv'24] PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
- [SOSP'24] ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
- [HPDC'24] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
- [EuroSys'24] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
- [NSDI'24] Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
- [arxiv'23] Unicron: Economizing Self-Healing LLM Training at Scale
- [VLDB'23] Eficient Fault Tolerance for Recommendation Model Training via Erasure Coding
- [SOSP'23] GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
- [SOSP'23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
- [NSDI'23] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
- [EuroSys'22] Varuna: scalable, low-cost training of massive deep learning models
- [ATC'22] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure
- [MLSys'21] Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
- [FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing
- [ICSE'20] An Empirical Study on Program Failures of Deep Learning Jobs
- [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
- [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
- [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework
- [SC'24] Optimizing Distributed ML Communication with Fused Computation-Collective Operations
- [SC'24] Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
- [NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
- [arxiv'24] LumosCore: Highly Scalable LLM Clusters with Optical Interconnect
- [TPDS'24] AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost
- [HOTI'24] Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives
- [SC'24] Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration
- [HPDC'24] Near-Optimal Wafer-Scale Reduce
- [arxiv'24] HiCCL: A Hierarchical Collective Communication Library
- [ICS'24] gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
- [ICS'24] Snoopie: A Multi-GPU Communication Profiler and Visualizer
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [arxiv'24] Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping
- [arxiv'24] Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
- [arxiv'24] Demystifying the Communication Characteristics for Distributed Transformer Models
- [ICPP'24] Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning
- [NAIC @ SIGCOMM'24] Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
- [NAIC @ SIGCOMM'24] Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
- [NAIC @ SIGCOMM'24] OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs
- [SIGCOMM'24] Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
- [SIGCOMM'24] RDMA over Ethernet for Distributed Training at Meta Scale
- [SIGCOMM'24] Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
- [SIGCOMM'24] MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
- [arxiv'24] MLTCP: Congestion Control for DNN Training
- [arxiv'24] Accelerating Distributed Deep Learning using Lossless Homomorphic Compression
- [arxiv'24] ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics
- [APNet'24] Understanding Communication Characteristics of Distributed Training
- [ICLR'24] ZeRO++: Extremely Efficient Collective Communication for Large Model Training
- [ICLR'24] CO2: Efficient Distributed Training with Full Communication-Computation Overlap
- [arxiv] [openreview]
- [MLSys'24] L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning
- [MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
- [ASPLOS'24] T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
- [ASPLOS'24] TCCL: Discovering Better Communication Paths for PCIe GPU Clusters
- [ASPLOS'24] Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
- [ASPLOS'24] Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM
- [NSDI'24] THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression
- [Survey 🔍] [arxiv'23] Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
- [arxiv'23] Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- [arxiv'23] Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training
- [arxiv'23] Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- [arxiv'23] TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training
- [INFOCOM'23] Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks
- [ICDCS'23] bbTopk: Bandwidth-Aware Sparse Allreduce with Blocked Sparsification for Efficient Distributed Training
- [ICML'23] CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
- Related to DT-FM (NeurIPS'22)
- [IPDPS'23] MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
- [ASPLOS'23] MSCCLang: Microsoft Collective Communication Language
- [ASPLOS'23] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
- [EuroSys'23] A2TP: Aggregator-aware In-network Aggregation for Multi-tenant Learning
- [EuroSys'23] Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
- [MLSys'23] Cupcake: A Compression Optimizer for Scalable Communication-Efficient Distributed Training
- [MLSys'23] On Optimizing the Communication of Model Parallelism
- [NSDI'23] TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
- [NSDI'23] Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
- [NSDI'23] TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
- [EuroSys'22] Out-of-order backprop: an effective scheduling technique for deep learning
- [ISCA'22] Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models
- [SC'22] HammingMesh: A Network Topology for Large-Scale Deep Learning
- [PPoPP'22] Near-optimal sparse allreduce for distributed deep learning
- [MLSys'22] Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning (
P^2
) - [ASPLOS'22] Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads (
CoCoNET
) - [EuroSys'21] DGCL: an efficient communication library for distributed GNN training
- [ICLR'21] Multi-Level Local SGD for Heterogeneous Hierarchical Networks
- [SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.5]
- [SC'21] Flare: flexible in-network allreduce
- [NSDI'21] Scaling Distributed Machine Learning with In-Network Aggregation
- [ISCA'21] Enabling compute-communication overlap in distributed deep learning training platforms
- [PPoPP'21] Synthesizing optimal collective algorithms (
SCCL
) - [SIGCOMM'21] SiP-ML: High-Bandwidth Optical Network Interconnects for Machine Learning Training
- [ISCA'20] An in-network architecture for accelerating shared-memory multiprocessor collectives
- [NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- [PPoPP'20] Taming unbalanced training workloads in deep learning with partial collective operations
- [MLSys'20] Blink: Fast and Generic Collectives for Distributed ML
- [MLSys'20] PLink: Discovering and Exploiting Datacenter Network Locality for Efficient Cloud-based Distributed Training
- [OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (
BytePS
) - [MLSys'19] Priority-based Parameter Propagation for Distributed DNN Training (
P3
) - [MLSys'19] TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
- [SOSP'19] A generic communication scheduler for distributed DNN training acceleration (
ByteScheduler
) - [ATC'17] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
- [SOSP'24] Scaling Deep Learning Computation over the Inter-core Connected Intelligence Processor with T10
- [OSDI'23] Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
- [OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
- [OSDI'23] Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
- [OSDI'23] EINNET: Optimizing Tensor Programs with Derivation-Based Transformations
- [OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm
- [OSDI'22] ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
- [OSDI'20] Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
- [OSDI'20] Ansor: Generating High-Performance Tensor Programs for Deep Learning
- [ASPLOS'20] FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
- [OSDI'18] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- [ACL'23] Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
- [ICLR'23] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- [OSDI'23] AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
- [ICML'22] TSPipe: Learn from Teacher Faster with Pipelines
For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.
- [arxiv'24] FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
- [ICPP'24] GNNDrive: Reducing Memory Contention and I/O Congestion for Disk-based GNN Training
- [VLDB'24] NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams
- [arxiv'23] ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training
- [arxiv'23] Helios: An Efficient Out-of-core GNN Training System on Terabyte-scale Graphs with In-memory Performance
- [arxiv'23] GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism
- [MLSys'23] Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
- [SIGMOD'23] DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU
- [OSDI'23] MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms
- [EuroSys'23] MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks
- [KDD'22] Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Heterogeneous Graphs
- [VLDB'22] TGL: a general framework for temporal GNN training on billion-scale graphs
- [OSDI'21] P3: Distributed Deep Graph Learning at Scale
- [arxiv'24] MLTCP: Congestion Control for DNN Training
- [HotNets'22] Congestion Control in Machine Learning Clusters
- [arxiv'24] SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration, (Code)
- [arxiv'24] SageAttention2 Technical Report: Accurate 4-Bit Attention for Plug-and-play Inference Acceleration, (Code)
- [arxiv'24] Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
- [ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
- [arxiv'24] EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
- [IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
- [arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
- [NeurIPS'24] Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting
- [NeurIPS'24] Toward Efficient Inference for Mixture of Experts
- [arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
- [SC'24] PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
- [NeurIPS'24 (spotlight)] Sequoia: Scalable and Robust Speculative Decoding
- [SC'24] SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing
- [arxiv'24] SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
- [arxiv'24] V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM
- [SenSys'24] LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning
- [arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
- [arxiv'24] NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
- [arxiv'24] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
- [MICRO'24] Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
- [arxiv'24] VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
- [arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
- [arxiv'24] Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
- [arxiv'24] POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
- [arxiv'24] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
- [arxiv'24] MagicPIG: LSH Sampling for Efficient LLM Generation
- [arxiv'24] Revisiting SLO and Goodput Metrics in LLM Serving
- [arxiv'24] EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
- [arxiv'24] ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
- [EuroSys'25] Fast State Restoration in LLM Serving with HCache
- [arxiv'24] SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
- [arxiv'24] TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
- [arxiv'24] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
- [arxiv'24] Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- [HPCA'24] KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers
- [arxiv'24] Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference
- [arxiv'24] Efficient LLM Scheduling by Learning to Rank
- [arxiv'24] P/D-Serve: Serving Disaggregated Large Language Model at Scale
- [arxiv'24] NanoFlow: Towards Optimal Large Language Model Serving Throughput
- [arxiv'24] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
- [SOSP'24] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
- [SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
- [SOSP'24] Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
- [SOSP'24] Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
- [arxiv'24] LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
- [ICPP'24] GMM: An Efficient GPU Memory Management-based Model Serving System for Multiple DNN Inference Models
- [SIGCOMM'24] CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
- [ES-FoMO @ ICML'24] CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models
- [OSDI'24] dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
- [OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- [OSDI'24] USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
- [OSDI'24] Fairness in Serving Large Language Models
- [OSDI'24] MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
- [OSDI'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- [OSDI'24] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
- [OSDI'24] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- [OSDI'24] Llumnix: Dynamic Scheduling for Large Language Model Serving
- [OSDI'24] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- [ATC'24] Power-aware Deep Learning Model Serving with μ-Serve
- [ATC'24] Fast Inference for Probabilistic Graphical Models
- [ATC'24] Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
- [ATC'24] PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
- [ATC'24] Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
- [TPDS'24] ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG
- [Survey 🔍] [arxiv'24] LLM Inference Serving: Survey of Recent Advances and Opportunities
- [arxiv'24] Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
- [arxiv'24] Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
- [arxiv'24] One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
- [OSDI'24] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
- [arxiv'24] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
- [ISCA'24] ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models
- [ISCA'24] Splitwise: Efficient generative LLM inference using phase splitting
- [ICML'24] HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
- [ICML'24] MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
- [HPCA'24] An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models
- [arxiv'24] Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
- [MobiSys'24] ARISE: High-Capacity AR Offloading Inference Serving via Proactive Scheduling
- [MobiSys'24] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
- [arxiv'24] Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
- [arxiv'24] HawkVision: Low-Latency Modeless Edge AI Serving
- [MLSys'24] HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
- [MLSys'24] S-LoRA: Serving Thousands of Concurrent LoRA Adapters
- [MLSys'24] Vidur: A Large-Scale Simulation Framework For LLM Inference
- [arxiv'24] The CAP Principle for LLM Serving
- [WWW'24] λGrapher: A Resource-Efficient Serverless System for GNN Serving through Graph Sharing
- [arxiv'24] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
- [ICML'24] CLLMs: Consistency Large Language Models
- [arxiv'24] BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- [EuroSys'24] Model Selection for Latency-Critical Inference Serving
- [arxiv'24] Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
- [arxiv'24] Learn To be Efficient: Build Structured Sparsity in Large Language Models
- [arxiv'24] Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
- [ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
- [arxiv'24] Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding
- [arxiv'24] ALTO: An Efficient Network Orchestrator for Compound AI Systems
- [ASPLOS'24] ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
- [ASPLOS'24] NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
- [arxiv'24] ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys
- [arxiv'24] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- [ICML'24] DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
- [ICLR'24] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
- [arxiv'24] FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
- [arxiv'24] Wisdom of Committee: Distilling from Foundation Model to SpecializedApplication Model
- [arxiv'24] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
- [arxiv'24] LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
- [NSDI'24] Approximate Caching for Efficiently Serving Diffusion Models
- [arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
- [arxiv'24] ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
- [arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
- [arxiv'24] FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
- [arxiv'24] Accelerating Retrieval-Augmented Language Model Serving with Speculation
- [arxiv'24] CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
- [arxiv'24] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
- [arxiv'24] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
- [Survey 🔍] [arxiv'24] Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
- [arxiv'24] Learned Best-Effort LLM Serving
- [arxiv'24] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- [VLDB'24] Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
- [ASPLOS'24] SpotServe: Serving Generative Large Language Models on Preemptible Instances
- [ASPLOS'24] SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification
- [arxiv'23] DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
- [EMNLP'23] Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
- [arxiv'23] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
- [arxiv'23] Fairness in Serving Large Language Models
- [arxiv'23] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
- [arxiv'23] Punica: Multi-Tenant LoRA Serving
- [arxiv'23] Pipeline Parallelism for DNN Inference with Practical Performance Guarantees
- [arxiv'23] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- [arxiv'23] High-throughput Generative Inference of Large Language Models with a Single GPU
- [HPDC'23] Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
- [SOSP'23] Paella: Low-latency Model Serving with Virtualized GPU Scheduling
- [SOSP'23] Efficient Memory Management for Large Language Model Serving with PagedAttention
- [MLSys'23] Efficiently Scaling Transformer Inference
- [EuroSys'23] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access
- [EuroSys'23] Tabi: An Efficient Multi-Level Inference System for Large Language Models
- [EuroSys'23] Pocket: ML Serving from the Edge
- [OSDI'23] AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- [NSDI'23] SHEPHERD: Serving DNNs in the Wild
- [VLDB'23] Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
- [ICML'23] Fast Inference from Transformers via Speculative Decoding
- [SIGMOD'22] Serverless Data Science - Are We There Yet? A Case Study of Model Serving
- [OSDI'22] Orca: A Distributed Serving System for Transformer-Based Generative Models
- [OSDI'22] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
- [ATC'22] SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
- [ATC'22] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
- [ATC'22] Tetris: Memory-efficient Serverless Inference through Tensor Sharing
- [ATC'22] PetS: A Unified Framework for Parameter-Efficient Transformers Serving
- [ATC'21] INFaaS: Automated Model-less Inference Serving
- [SoCC'21] Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving
- [arxiv'21] Supporting Massive DLRM Inference through Software Defined Memory
- [MobiCom'20] SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud
This is the list of papers about MoE training and inference (collected from 2.6 and 3).
- [ML for Sys workshop @ NeurIPS'24] IFMoE: An Inference Framework Design for Fine-grained MoE
- [ML for Sys workshop @ NeurIPS'24] TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation
- [arxiv'24] Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
- [arxiv'24] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
- [MLSys'24] SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
- [arxiv'24] Pro-Prophet: Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models
- [EMNLP'24] Mixture of Diverse Size Experts
- [ACL'24] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
- [SoCC'24] MoEsaic: Shared Mixture of Experts
- [KDD'24] Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing
- [arxiv'24] Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
- [IPDPS'24] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
- [arxiv'24] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
- [arxiv'24] Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
- [NeurIPS'24] Toward Efficient Inference for Mixture of Experts
- [arxiv'24] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
- [MLSys'24] SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
- [SC'24] APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes
- [NeurIPS'24] GraphMETRO: Mitigating Complex Graph Distribution Shifts via Mixture of Aligned Experts
- [arxiv'24] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
- [arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
- [NeurIPS'24] LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
- [arxiv'24] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
- [arxiv'24] Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
- [NeurIPS'24] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
- [arxiv'24] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
- [arxiv'24] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
- [arxiv'24] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
- [arxiv'24] Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
- [arxiv'24] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
- [arxiv'24] Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
- [arxiv'24] MoH: Multi-Head Attention as Mixture-of-Head Attention
- [arxiv'24] AT-MoE: Adaptive Task-planning Mixture of Experts via LoRA Approach
- [NeurIPS'24 (Splotlight)] Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
- [arxiv'24] Aria: An Open Multimodal Native Mixture-of-Experts Model
- [arxiv'24] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
- [arxiv'24] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
- [arxiv'24] Upcycling Large Language Models into Mixture of Experts
- [arxiv'24] No Need to Talk: Asynchronous Mixture of Language Models
- [arxiv'24] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
- [arxiv'24] HMoE: Heterogeneous Mixture of Experts for Language Modeling
- [arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
- [arxiv'24] AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
- [arxiv'24] Layerwise Recurrent Router for Mixture-of-Experts
- [arxiv'24] Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
- [SRW @ ACL'24] MoExtend: Tuning New Experts for Modality and Task Extension
- [arxiv'24] MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts
- [arxiv'24] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
- [arxiv'24] Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
- [arxiv'24] Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
- [ICML'24] Scaling Laws for Fine-Grained Mixture of Experts
- [ICML'24] Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training
- [MLSys'24] QMoE: Sub-1-Bit Compression of Trillion-Parameter Models
- [MLSys'24] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
- [arxiv'24] CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
- [arxiv'24] AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts
- [SIGIR'24] M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation Framework
- [EuroSys'24] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
- [arxiv'24] MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts
- [ICLR'24] Mixture of LoRA Experts
- [arxiv'24] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
- [arxiv'24] MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
- [IJCAI'24] LocMoE: A Low-overhead MoE for Large Language Model Training
- [ISCA'24] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
- [EMNLP'23] Adaptive Gating in Mixture-of-Experts based Language Models
- [ACL'23] AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
- [arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
- [ATC'23] Accelerating Distributed MoE Training and Inference with Lina
- [ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization
- [SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
- [ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
- [MLSys'23] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
- [MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale
- [PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
- [SustaiNLP @ EMNLP'22] Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production
- [NeurIPS'22] Mixture-of-Experts with Expert Choice Routing
- [ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- [ICML'22] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- [JMLR'22] Switch transformers: scaling to trillion parameter models with simple and efficient sparsity
- [ICLR'17] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- [SOSP'24] LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
- [arxiv'24] USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
- [arxiv'24] Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
- [NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
- [arxiv'24] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
- [arxiv'24] Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
- [arxiv'24] CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
- [COLM'24] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
- [arxiv'24] FocusLLM: Scaling LLM's Context by Parallel Decoding
- [Survey 🔍] [IJCAI'24] X-former Elucidator: Reviving Efficient Attention for Long Context Language Modeling
- [arxiv'24] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts
- [MLSys'24] LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
- [arxiv'24] FedEx: Expediting Federated Learning over Heterogeneous Mobile Devices by Overlapping and Participant Selection
- [KDD'24] FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model
- [CCGrid'24] Apodotiko: Enabling Efficient Serverless Federated Learning in Heterogeneous Environments
- [EuroSys'24] Dordis: Efficient Federated Learning with Dropout-Resilient Differential Privacy
- [arxiv'24] Decoupled Vertical Federated Learning for Practical Training on Vertically Partitioned Data
- [SAC'24] Training Heterogeneous Client Models using Knowledge Distillation in Serverless Federated Learning
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [arxiv'23] Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization
- [IMWUT'23] AttFL: A Personalized Federated Learning Framework for Time-series Mobile and Embedded Sensor Data Processing
- [Survey 🔍] [FGCS'23] Model aggregation techniques in federated learning: A comprehensive survey
- [SoCC'23] Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering
- [MLSys'23] GlueFL: Reconciling Client Sampling and Model Masking for Bandwidth Efficient Federated Learning
- [WWW'23] To Store or Not? Online Data Selection for Federated Learning with Limited Storage
- [EuroSys'23] REFL: Resource-Efficient Federated Learning
- [VLDB'23] FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
- [RecSys'22] Towards Fair Federated Recommendation Learning: Characterizing the Inter-Dependence of System and Data Heterogeneity
- [TMLR'22] Optimal Client Sampling for Federated Learning
- [ICML'22] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale
- [MobiSys'22] FedBalancer: data and pace control for efficient federated learning on heterogeneous clients
- [MobiCom'22] PyramidFL: A Fine-grained Client Selection Framework for Efficient Federated Learning
- [MLSys'22] PAPAYA: Practical, Private, and Scalable Federated Learning
- [AISTATS'22] Federated Learning with Buffered Asynchronous Aggregation
- [NeurIPS'21] Federated Reconstruction: Partially Local Federated Learning
- [NeurIPS'21] FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout
- [OSDI'21] Oort: Efficient Federated Learning via Guided Participant Selection
- [MICRO'21] AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning
- [MLSys'19] Towards Federated Learning at Scale: System Design
- [Survey 🔍] [ACM CSUR'22] Federated Learning for Smart Healthcare: A Survey
- [DAC'23] Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators
- [ICLR'23] MPCFormer: fast, performant and private Transformer inference with MPC
- [NeurIPS'22] Iron: Private Inference on Transformers
- [arxiv'24] APIServe: Efficient API Support for Large-Language Model Inferencing
- [OSDI'24] ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications
- [ICML'22] Efficient Online ML API Selection for Multi-Label Classification Tasks (
FrugalMCT
) - [NeurIPS'20] FrugalML: How to use ML Prediction APIs more accurately and cheaply
- [ICSE'25] Large Language Models as Configuration Validators
- [NeurIPS'24] IaC-Eval: A code generation benchmark for Infrastructure-as-Code programs
- [arxiv'24] Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
- [arxiv'24] LLMTune: Accelerate Database Knob Tuning with Large Language Models
- [SIGCOMM'24] NetLLM: Adapting Large Language Models for Networking
- [arxiv'24] LLM-Enhanced Data Management
- [arxiv'24] MPIrigen: MPI Code Generation through Domain-Specific Language Models
- [arxiv'24] Can Large Language Models Write Parallel Code?
- [arxiv'23] LLM-Assisted Code Cleaning For Training Accurate Code Generators
- [arxiv'23] Large Language Models for Compiler Optimization
- [VLDB'23] How Large Language Models Will Disrupt Data Management
- [arxiv'24] ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs
- [RTAS'24] Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management
- slides: link
- [OSDI'23] Welder: Scheduling Deep Learning Memory Access via Tile-graph
- [arxiv'21] Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads
- [SIGMETRICS'21] Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels
- [NeurIPS'20] Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
- [RTSS'17] GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed
- [arxiv'24] DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- [SOSP'24] Perseus: Removing Energy Bloat from Large Model Training
- [arxiv'23] CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
- [ATC'23] EnvPipe: Performance-preserving DNN Training Framework for Saving Energy
- [NSDI'23] Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
- [CPAL'24 (PMLR)] Jaxpruner: A Concise Library for Sparsity Research
- [arxiv'24] Scorch: A Library for Sparse Deep Learning
- [arxiv'24] Drowning in Documents: Consequences of Scaling Reranker Inference
- [arxiv'24] Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions
- [arxiv'24] Computational Bottlenecks of Training Small-scale Large Language Models
- [Survey 🔍] [arxiv'24] A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness
- [arxiv'24] AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
- [ASPLOS'25 (to appear)] PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption
- [NeurIPS'24] Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- [NeurIPS'24 Workshop] Long Context RAG Performance of Large Language Models
- [arxiv'24] Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
- [arxiv'24] DroidSpeak: Enhancing Cross-LLM Communication
- [arxiv'24] Disaggregating Embedding Recommendation Systems with FlexEMR
- [arxiv'24] JudgeBench: A Benchmark for Evaluating LLM-based Judges
- [VLDB'25] Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models
- [arxiv'24] You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- [arxiv'24] Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
- [Survey 🔍] [arxiv'24] A Survey of Resource-efficient LLM and Multimodal Foundation Models
- [arxiv'23] Efficiently Programming Large Language Models using SGLang
- [MICRO'23] Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads
This repository is motivated by:
- https://github.com/HuaizhengZhang/Awesome-System-for-Machine-Learning
- https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
- https://github.com/ganler/ResearchReading
- https://jeongseob.github.io/readings_mlsys.html
- https://github.com/chwan1016/awesome-gnn-systems
- https://github.com/ConnollyLeon/awesome-Auto-Parallelism