This repository is a fork of Apache Spark that natively supports using HashiCorp's Nomad as Spark's cluster manager (as an alternative to Hadoop YARN and Mesos). When running on Nomad, the Spark executors that run tasks for your Spark application, and optionally the application driver itself, run as Nomad tasks in a Nomad job.
Sample spark-submit
command when using Nomad:
spark-submit \
--class org.apache.spark.examples.JavaSparkPi \
--master nomad \
--deploy-mode cluster \
--conf spark.executor.instances=4 \
--conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/nomad-spark/spark-2.1.0-bin-nomad.tgz \
https://s3.amazonaws.com/nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100
The ultimate goal is to integrate Nomad into Spark directly, either natively or via a backend/scheduler plugin interface.
Nomad's design is heavily inspired by Google's work on both Borg and Omega. This has enabled a set of features that make Nomad well-suited to run analytical applications. Particularly relevant are its native support for batch workloads and parallelized, high throughput scheduling (more on scheduler internals here).
Nomad is easy to set up and use. It consists of a single binary/process, has a simple and intuitive data model, utilizes a declarative job specification and supports high availability and multi-datacenter federation out-of-the-box. Nomad also integrates seamlessly with HashiCorp's other runtime tools: Consul and Vault.
To get started, see Nomad's official Apache Spark Integration Guide. You can also use Nomad's example Terraform configuration and embedded Spark quickstart to give the integration a test drive on AWS. Builds are currently available for Spark 2.1.0 and 2.1.1.