Skip to content

Latest commit

 

History

History
391 lines (284 loc) · 22.4 KB

README.md

File metadata and controls

391 lines (284 loc) · 22.4 KB

Spark machine learning inventory Awesome

A curated inventory of machine learning methods available on the Apache Spark platform, both in official and third party libraries.

Table of Contents

Project inventory

Machine learning & related libraries

Bundled with Spark

  • GraphX - Apache Spark's API for graphs and graph-parallel computation
  • MLlib - Apache Spark's built in machine learning library

Third party libraries

  • Aerosolve - A machine learning package built for humans
  • AMIDST - probabilistic machine learning
  • BigDL - BigDL: Distributed Deep Learning Library for Apache Spark
  • CoCoA - communication-efficient distributed coordinate ascent
  • Deeplearning4j - Deeplearning4j on Spark
  • DissolveStruct - Distributed Solver for Structured Prediction
  • DistML - DistML provide a supplement to mllib to support model-parallel on Spark
  • Elephas - Distributed Deep learning with Keras & Spark
  • Generalized K-means clustering - generalizes the Spark MLLIB Batch and Streaming K-Means clusterers in every practical way
  • KeystoneML - KeystoneML is a software framework, written in Scala, from the UC Berkeley AMPLab designed to simplify the construction of large scale, end-to-end, machine learning pipelines with Apache Spark
  • MLbase - MLbase is a platform addressing implementing and consuming Machine Learning at scale
  • ml-matrix - distributed matrix library
  • revrand - A library of scalable Bayesian generalised linear models with fancy features
  • spark-ts - Time series for Spark
  • Sparkling Water - H2O + Apache Spark
  • Splash - a general framework for parallelizing stochastic learning algorithms on multi-node clusters
  • Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark
  • StreamDM - Data Mining for Spark Streaming
  • Thunder - scalable image and time series analysis
  • Zen - aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN

Interfaces

Notebooks

  • Apache Zeppelin - A web-based notebook that enables interactive data analytics
  • Beaker - The data scientist's laboratory
  • Spark Notebook - Interactive and Reactive Data Science using Scala and Spark
  • sparknotebook - running Apache Spark using Scala in ipython notebook

Visualization

Others

Task inventory

  • MLlib - Apache Spark's built in machine learning library

Ensemble learning & parallel modelling

Libraries

  • DistML - DistML provide a supplement to mllib to support model-parallel on Spark
  • Elephas - Distributed Deep learning with Keras & Spark
  • spark-FM-parallelISGD - Implementation of Factorization Machines on Spark using parallel stochastic gradient descent
  • SparkBoost - A distributed implementation of AdaBoost.MH and MP-Boost using Apache Spark
  • StreamDM - Data Mining for Spark Streaming

Algorithms

Classification

Libraries

  • MLlib - Apache Spark's built in machine learning library
  • DissolveStruct - Distributed Solver for Structured Prediction
  • Spark kNN graphs - Spark algorithms for building k-nn graphs
  • Spark-libFM - implementation of Factorization Machines
  • Sparkling Ferns - Implementation of Random Ferns for Apache Spark
  • StreamDM - Data Mining for Spark Streaming

Algorithms

Clustering

Libraries

  • MLlib - Apache Spark's built in machine learning library
  • Bisecting K-means - implementation of Bisecting KMeans Clustering which is a kind of Hierarchical Clustering algorithm
  • Generalized K-means clustering - generalizes the Spark MLLIB Batch and Streaming K-Means clusterers in every practical way
  • Patchwork - Highly Scalable Grid-Density Clustering Algorithm for Spark MLLib
  • spark-tsne - Distributed t-SNE via Apache Spark
  • StreamDM - Data Mining for Spark Streaming

Algorithms

Data Transformation, Feature Selection & Dimensionality Reduction

Libraries

Algorithms

Deep Learning

Libraries

  • BigDL - BigDL: Distributed Deep Learning Library for Apache Spark
  • CaffeOnSpark - CaffeOnSpark brings deep learning to Hadoop and Spark clusters
  • Deeplearning4j - Deeplearning4j on Spark
  • DeepSpark - A neural network library which uses Spark RDD instances
  • Elephas - Distributed Deep learning with Keras & Spark
  • Sparkling Water - H2O + Apache Spark
  • TensorFrames - Tensorflow wrapper for DataFrames on Apache Spark

Graph computations

Libraries

  • GraphX - Apache Spark's API for graphs and graph-parallel computation

  • Spark kNN graphs - Spark algorithms for building k-nn graphs

  • SparklingGraph - large scale, distributed graph processing made easy

Itemset mining, frequent pattern mining & association rules

Linear algebra

Libraries

  • lazy-linalg - A package full of linear algebra operators for Apache Spark MLlib's linalg package
  • ml-matrix - distributed matrix library

Algorithms

  • Singular Value Decomposition (SVD): MLlib
  • Principal Component Analysis (PCA): MLlib

Matrix factorization & recommender systems

Libraries

  • MLlib - Apache Spark's built in machine learning library

  • spark-FM-parallelISGD - Implementation of Factorization Machines on Spark using parallel stochastic gradient descent

  • Spark-libFM - implementation of Factorization Machines

  • Streaming Matrix Factorization - Distributed Streaming Matrix Factorization implemented on Spark for Recommendation Systems

Algorithms

Natural language processing

Libraries

Algorithms

Optimization & hyperparameter search

Libraries

  • MLlib - Apache Spark's built in machine learning library

  • Elephas - Distributed Deep learning with Keras & Spark

  • Spark-TFOCS - port of TFOCS: Templates for First-Order Conic Solvers (cvxr.com/tfocs)

Algorithms

  • Alternating Least Squares (ALS): MLlib
  • First-Order Conic solvers: Spark-TFOCS
  • Gradient descent: MLlib
  • Grid Search: MLlib
  • Iteratively Reweighted Least Squares (IRLS): MLlib
  • Limited-memory BFGS (L-BFGS): MLlib
  • Normal equation solver: MLlib
  • Stochastic gradient descent (SGD): MLlib
  • Tree of Parzen estimators (TPE -- hyperopt): Elephas - Distributed Deep learning with Keras & Spark

Regression

Libraries

  • MLlib - Apache Spark's built in machine learning library
  • revrand - A library of scalable Bayesian generalised linear models with fancy features
  • StreamDM - Data Mining for Spark Streaming

Algorithms

  • Bayesian generalised linear models: revrand
  • Decision tree regression: MLlib
  • Generalized linear regression: MLlib
  • Gradient-boosted tree regression: MLlib
  • Isotonic regression: MLlib
  • Linear regression: MLlib, StreamDM
  • Linear least squares: MLlib
  • Random forest regression: MLlib
  • Ridge regression: MLlib
  • Survival regression: MLlib
  • Support Vector Machine (SVM): MLlib

Statistics

  • Hypothesis testing: MLlib
  • Kernel density estimation: MLlib

Tensor decompositions

Libraries

  • Spectral LDA on Spark - implements a spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark

Algorithms

Time series

Libraries

  • spark-ts - Time series for Spark
  • Thunder - scalable image and time series analysis

Algorithms

Practical info

License

CC0

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

To add content, feel free to open an issue or create a pull request.

Acknowledgments

This inventory is inspired by mfornos’ inventory of awesome microservices.

Table of contents generated with DocToc.