Skip to content

Latest commit

 

History

History
963 lines (902 loc) · 104 KB

README.md

File metadata and controls

963 lines (902 loc) · 104 KB

Paper List for Machine Learning Systems

Awesome PRs Welcome

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

Table of Contents

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

1.1.2 Prep stalls

1.1.3 Fetch stalls (I/O)

1.1.4 Specific workloads (GNN, DLRM)

1.2 Caching and Distributed storage for ML training

1.3 Data formats

  • [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
  • [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

1.4 Data pipeline fairness and correctness

  • [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

1.5 Data labeling automation

  • [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

2. Training System

2.1 Empirical study on ML Jobs

  • [ICSE'24] An Empirical Study on Low GPU Utilization of Deep Learning Jobs
  • [NSDI'24] Characterization of Large Language Model Development in the Datacenter
  • [NSDI'22] MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters (PAI)
  • [ATC'19] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (Philly)

2.2 DNN job scheduling

2.3 GPU sharing

2.4 GPU memory management and optimization

2.5 GPU memory usage estimate

  • [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models

2.6 Distributed training (Parallelism)

2.7 DL job failures / Fault tolerance (resilient training)

2.8 AutoML

  • [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
  • [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
  • [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

2.9 Communication optimization & Network Infrastructure for ML

2.10 DNN compiler

2.11 Model pruning and compression

2.12 GNN training system

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

2.13 Congestion control for DNN training

2.14 Others

3. Inference System

4. Mixture of Experts (MoE)

This is the list of papers about MoE training and inference (collected from 2.6 and 3).

5. LLM Long Context

6. Federated Learning

7. Privacy-Preserving ML

8. ML APIs & Application-side Optimization

9. ML (LLM) for Systems

10. GPU kernel scheduling

11 Energy-efficiency for LLM (carbon-aware)

Others

References

This repository is motivated by: