Marco's SysML reading list

A curated reading list of computer science research for work at the intersection of machine learning and systems. PR are welcome.

Review

A Berkeley View of Systems Challenges for AI https://arxiv.org/pdf/1712.05855.pdf

Strategies and Principles of Distributed Machine Learning on Big Data https://arxiv.org/abs/1512.09295

Background

Deep learning Nature volume 521, 2015 https://www.nature.com/articles/nature14539

Deep learning reading list http://deeplearning.net/reading-list

Measurement

Multi-tenant GPU Clusters for Deep LearningWorkloads: Analysis and Implications https://www.microsoft.com/en-us/research/uploads/prod/2018/05/gpu_sched_tr.pdf

Frameworks

TensorFlow: A System for Large-Scale Machine Learning OSDI 2016 https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

Ray: A Distributed Framework for Emerging AI Applications OSDI 2018 https://www.usenix.org/system/files/osdi18-moritz.pdf

Tuning

HyperDrive: Exploring Hyperparameters with POP Scheduling MiddleWare 2017 https://dl.acm.org/citation.cfm?id=3135994

Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads VLDB 2018 http://www.vldb.org/pvldb/vol11/p607-li.pdf

Automating Model Search for Large Scale Machine Learning SoCC 2015 http://dl.acm.org/authorize?N91362

Google Vizier: A Service for Black-Box Optimization KDD 2017 https://dl.acm.org/citation.cfm?id=3098043

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization Journal of Machine Learning Research 18 (2018) https://arxiv.org/pdf/1603.06560.pdf

Hyperopt: a Python library for model selection and hyperparameter optimization Computational Science & Discovery, 8(1) 2015 http://iopscience.iop.org/article/10.1088/1749-4699/8/1/014008

Auto-Keras: Efficient Neural Architecture Search with Network Morphism https://arxiv.org/pdf/1806.10282v2.pdf

Runtime execution

Cavs: An Efficient Runtime System for Dynamic Neural Networks ATC 2018 https://www.usenix.org/system/files/conference/atc18/atc18-xu-shizhen.pdf

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning OSDI 2018 https://www.usenix.org/system/files/osdi18-chen.pdf

PipeDream: Fast and Efficient Pipeline Parallel DNN Training https://arxiv.org/pdf/1806.03377.pdf

STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning EuroSys 2016 https://dl.acm.org/citation.cfm?id=2901331

Dynamic Control Flow in Large-Scale Machine Learning EuroSys 2018 https://dl.acm.org/citation.cfm?id=3190551

Improving the Expressiveness of Deep Learning Frameworks with Recursion EuroSys 2018 https://dl.acm.org/citation.cfm?id=3190530

Continuum: A Platform for Cost-Aware, Low-Latency Continual Learning SoCC 2018 https://dl.acm.org/citation.cfm?id=3267817

KeystoneML: Optimizing Pipelines for Large-ScaleAdvanced Analytics ICDE 2017 https://amplab.cs.berkeley.edu/wp-content/uploads/2017/01/ICDE_2017_CameraReady_475.pdf

Owl: A General-Purpose Numerical Library in OCaml https://arxiv.org/pdf/1707.09616.pdf

Distributed learning

Large Scale Distributed Deep Networks NIPS 2012 https://ai.google/research/pubs/pub40565.pdf

Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics SoCC 2015 http://dl.acm.org/authorize?N91363

Ako: Decentralised Deep Learning with Partial Gradient Exchange SOCC 2016 https://lsds.doc.ic.ac.uk/sites/default/files/ako-socc16.pdf

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters ATC 2017 https://www.usenix.org/system/files/conference/atc17/atc17-zhang.pdf

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training SoCC 2018 https://dl.acm.org/citation.cfm?id=3267840

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems ML Systems Workshop at NIPS 2016 https://arxiv.org/pdf/1512.01274.pdf

Scaling Distributed Machine Learning with the Parameter Server OSDI 2014 https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf

Project Adam: Building an Efficient and Scalable Deep Learning Training System OSDI 2014 https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf

Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co-design SoCC 2018 https://dl.acm.org/citation.cfm?id=3267810

Petuum: A New Platform for Distributed Machine Learning on Big Data KDD 2015 https://arxiv.org/pdf/1312.7651.pdf

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism https://arxiv.org/pdf/1811.06965.pdf

Serving systems and inference

DeepCPU: Serving RNN-based Deep Learning Models 10x Faster ATC 2018 https://www.usenix.org/system/files/conference/atc18/atc18-zhang-minjia.pdf

Clipper: A Low-Latency Online Prediction Serving System NSDI 2017 https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf

Research for Practice: Prediction-Serving Systems ACM Queue 16(1), 2018 https://queue.acm.org/detail.cfm?id=3210557

InferLine: ML Inference Pipeline Composition https://arxiv.org/pdf/1812.01776.pdf

PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems OSDI 2018 https://www.usenix.org/system/files/osdi18-lee.pdf

Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System MiddleWare 2018 https://dl.acm.org/citation.cfm?id=3274813

Low Latency RNN Inference with Cellular Batching EuroSys 2018 https://dl.acm.org/citation.cfm?id=3190541

SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism SC 2016 https://ieeexplore.ieee.org/document/7877104

NoScope: Optimizing Neural Network Queries over Video at Scale VLDB 2017 https://dl.acm.org/citation.cfm?id=3137664

Scheduling Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters EuroSys 2018 https://dl.acm.org/citation.cfm?id=3190517

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning SoCC 2017 https://dl.acm.org/authorize?N46878

Proteus: agile ML elasticity through tiered reliability in dynamic resource markets EuroSys 2017 https://dl.acm.org/citation.cfm?id=3064182

Gandiva: Introspective Cluster Scheduling for Deep Learning OSDI 2018 https://www.usenix.org/system/files/osdi18-xiao.pdf

Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments SC 2017 https://dl.acm.org/citation.cfm?id=3126933

Algorithmic aspects in scalable ML

Hemingway: Modeling Distributed Optimization Algorithms ML Systems Workshop at NIPS 2016 https://arxiv.org/pdf/1702.05865.pdf

Asynchronous Methods for Deep Reinforcement Learning ICML 2016 http://proceedings.mlr.press/v48/mniha16.pdf

Don't Use Large Mini-Batches, Use Local SGD https://arxiv.org/pdf/1808.07217.pdf

GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server EuroSys 2016 https://dl.acm.org/citation.cfm?id=2901323

ImageNet Training in Minutes ICPP 2018 https://dl.acm.org/citation.cfm?id=3225069

Semantics-Preserving Parallelization of Stochastic Gradient Descent IPDPS 2018 https://ieeexplore.ieee.org/abstract/document/8425176

HOGWILD!: A Lock-Free Approach to ParallelizingStochastic Gradient Descent NIPS 2011 https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf

QSGD: Communication-Efficient SGD via Randomized Quantization NIPS 2017 https://papers.nips.cc/paper/6768-qsgd-communication-efficient-sgd-via-gradient-quantization-and-encoding.pdf

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent NIPS 2017 https://papers.nips.cc/paper/7117-can-decentralized-algorithms-outperform-centralized-algorithms-a-case-study-for-decentralized-parallel-stochastic-gradient-descent.pdf

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD AIStats 2018 https://arxiv.org/pdf/1803.01113.pdf

Probabilistic Synchronous Parallel https://arxiv.org/pdf/1709.07772.pdf

AI Testing and Verification

DeepXplore: Automated Whitebox Testing of Deep Learning Systems SOSP 2017 https://dl.acm.org/authorize?N47145

Programmatically Interpretable Reinforcement Learning ICML 2018 https://arxiv.org/pdf/1804.02477.pdf

AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation SP 2018 https://ieeexplore.ieee.org/document/8418593

Interpretability and Explainability

“Why Should I Trust You?”Explaining the Predictions of Any Classifier KDD 2016 https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf

Learning to Explain: An Information-Theoretic Perspective on Model Interpretation ICML 2018 https://arxiv.org/pdf/1802.07814.pdf

A Unified Approach to Interpreting Model Predictions NIPS 2017 https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf

The Mythos of Model Interpretability WHI 2016 https://arxiv.org/pdf/1606.03490.pdf

Model Management

MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis SIGMOD 2018 https://dl.acm.org/citation.cfm?id=3196934

MODELDB: A System for Machine Learning Model Management HILDA 2016 https://mitdbg.github.io/modeldb/papers/hilda_modeldb.pdf

Model Governance: Reducing the Anarchy of Production ML ATC 2018 https://www.usenix.org/system/files/conference/atc18/atc18-sridhar.pdf

The Missing Piece in Complex Analytics: Low Latency,Scalable Model Management and Serving with Velox CIDR 2015 http://www.bailis.org/papers/velox-cidr2015.pdf

Bandana: Using Non-volatile Memory for Storing Deep Learning Models SysML 2019 https://arxiv.org/abs/1811.05922

Hardware

Deep learning with limited numerical precision ICML 2015 http://proceedings.mlr.press/v37/gupta15.pdf

In-Datacenter Performance Analysis of a Tensor Processing Unit ISCA 2017 https://dl.acm.org/citation.cfm?id=3080246

Serving DNNs in Real Timeat Datacenter Scale with Project Brainwave IEEE MICRO 38(2), Mar./Apr. 2018 https://ieeexplore.ieee.org/document/8344479

Security aspects

Efficient Deep Learning on Multi-Source Private Data https://arxiv.org/pdf/1807.06689.pdf

Chiron: Privacy-preserving Machine Learning as a Service https://arxiv.org/pdf/1803.05961.pdf

MLCapsule: Guarded Offline Deployment of Machine Learning as a Service https://arxiv.org/pdf/1808.00590.pdf

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware https://arxiv.org/pdf/1806.03287.pdf

Privado: Practical and Secure DNN Inference https://arxiv.org/pdf/1810.00602.pdf

ML Platforms (Applied)

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective HPCA 2018 https://research.fb.com/publications/applied-machine-learning-at-facebook-a-datacenter-infrastructure-perspective/

Machine Learning at Facebook: Understanding Inference at the Edge HPCA 2019 https://research.fb.com/publications/machine-learning-at-facebook-understanding-inference-at-the-edge/

Meet Michelangelo: Uber’s Machine Learning Platform https://eng.uber.com/michelangelo/

Introducing FBLearner Flow: Facebook’s AI backbone https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/

TFX: A TensorFlow-Based Production-Scale Machine LearningPlatform http://dl.acm.org/authorize?N33328

Horovod: fast and easy distributed deep learning in TensorFlow https://arxiv.org/pdf/1802.05799v3.pdf

ML for Systems

Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms SOSP 2017 https://dl.acm.org/authorize?N47144

Adaptive Execution of Continuous and Data-intensive Workflows with Machine Learning MiddleWare 2018 https://dl.acm.org/citation.cfm?id=3274827

AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization SIGCOMM 2018 https://dl.acm.org/citation.cfm?id=3230551

Neural Adaptive Video Streaming with Pensieve SIGCOMM 2017 https://dl.acm.org/citation.cfm?id=3098843

Neural Adaptive Content-aware Internet Video Delivery OSDI 2018 https://www.usenix.org/system/files/osdi18-yeo.pdf

Workshops

Systems for ML and Open Source Software Workshop at NeurIPS 2018 http://learningsys.org/nips18/acceptedpapers.html

SysML 2018 http://www.sysml.cc/2018/index.html

Engineering Dependable and Secure Machine Learning Systems 2019 https://sites.google.com/view/edsmls2019/program

Engineering Dependable and Secure Machine Learning Systems 2018 https://sites.google.com/edu.haifa.ac.il/edsmls/program

Workshop on Distributed Machine Learning 2017 https://distributedml2017.wordpress.com/schedule/

ML Systems Workshop at NIPS 2016 https://sites.google.com/site/mlsysnips2016/accepted-papers

Upcoming 2019

ColumnML: Column Store Machine Learning with On The Fly Data Transformation VLDB 2019

Continuous Integration of Machine Learning Models: A Rigorous Yet Practical Treatment SysML 2019

Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices ASPLOS 2019

RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning SysML 2019 https://arxiv.org/pdf/1810.09028.pdf

For adding/updating the list

Fork the repository
Update this file
Send a pull request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Marco's SysML reading list

Review

Background

Measurement

Frameworks

Tuning

Runtime execution

Distributed learning

Serving systems and inference

Algorithmic aspects in scalable ML

AI Testing and Verification

Interpretability and Explainability

Model Management

Hardware

Security aspects

ML Platforms (Applied)

ML for Systems

Workshops

Upcoming 2019

For adding/updating the list

Files

README.md

Latest commit

History

README.md

File metadata and controls

Marco's SysML reading list

Review

Background

Measurement

Frameworks

Tuning

Runtime execution

Distributed learning

Serving systems and inference

Algorithmic aspects in scalable ML

AI Testing and Verification

Interpretability and Explainability

Model Management

Hardware

Security aspects

ML Platforms (Applied)

ML for Systems

Workshops

Upcoming 2019

For adding/updating the list