- A Scala tool which helps deploying Apache Spark stand-alone cluster and submitting job.
- Currently supports Amazon EC2 with Spark 1.4.1+.
- There are two modes when using spark-deployer, SBT plugin and embedded mode.
- Set the environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
for AWS. - Prepare a project with structure like below:
project-root
├── build.sbt
├── project
│ └── plugins.sbt
├── spark-deployer.conf
└── src
└── main
└── scala
└── mypackage
└── Main.scala
- Write one line in
project/plugins.sbt
:
addSbtPlugin("net.pishen" % "spark-deployer-sbt" % "1.3.0")
- Write your cluster configuration in
spark-deployer.conf
(see the example below). If you want to use another configuration file name, please set the environment variableSPARK_DEPLOYER_CONF
when starting sbt (e.g.$ SPARK_DEPLOYER_CONF=./my-spark-deployer.conf sbt
). - Write your Spark project's
build.sbt
(Here we give a simple example):
lazy val root = (project in file("."))
.settings(
name := "my-project-name",
version := "0.1",
scalaVersion := "2.10.6",
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
)
)
- Write your job's algorithm in
src/main/scala/mypackage/Main.scala
:
package mypackage
import org.apache.spark._
object Main {
def main(args: Array[String]) {
//setup spark
val sc = new SparkContext(new SparkConf())
//your algorithm
val n = 10000000
val count = sc.parallelize(1 to n).map { i =>
val x = scala.math.random
val y = scala.math.random
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
}
}
- Create the cluster by
sbt "sparkCreateCluster <number-of-workers>"
. You can also executesbt
first and typesparkCreateCluster <number-of-workers>
in the sbt console. You may first typespark
and hit TAB to see all the available commands. - Once the cluster is created, submit your job by
sparkSubmitJob <arg0> <arg1> ...
- When your job is done, destroy your cluster by
sparkDestroyCluster
sparkCreateMaster
creates only the master node.sparkAddWorkers <number-of-workers>
supports dynamically add more workers to an existing cluster.sparkCreateCluster <number-of-workers>
shortcut command for the above two commands.sparkRemoveWorkers <number-of-workers>
supports dynamically remove workers to scale down the cluster.sparkDestroyCluster
terminates all the nodes in the cluster.sparkRestartCluster
restart the cluster with new settings fromspark-env
without recreating the machines.sparkShowMachines
shows the machine addresses with commands to login master and execute spark-shell on it.sparkUploadJar
uploads the job's jar file to master node.sparkSubmitJob
uploads and runs the job.sparkRemoveS3Dir <dir-name>
remove the s3 directory with the_$folder$
folder file. (ex.sparkRemoveS3Dir s3://bucket_name/middle_folder/target_folder
)
If you don't want to use sbt, or if you would like to trigger the cluster creation from within your Scala application, you can include the library of spark-deployer directly:
libraryDependencies += "net.pishen" %% "spark-deployer-core" % "1.3.0"
Then, from your Scala code, you can do something like this:
import sparkdeployer._
val sparkDeployer = new SparkDeployer(ClusterConf.fromFile("path/to/spark-deployer.conf"))
val numOfWorkers = 2
val jobJar = new File("path/to/job.jar")
val args = Seq("arg0", "arg1")
sparkDeployer.createCluster(numOfWorkers)
sparkDeployer.submitJob(jobJar, args)
sparkDeployer.destroyCluster()
- Environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
should be set in this mode, too. - You may prepare the
job.jar
by sbt-assembly from other sbt project with Spark. - For other available functions, check
SparkDeployer.scala
in our source code.
spark-deployer uses slf4j, remember to add your own backend to see the log. For example, to print the log on screen, add
libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.14"
Here we give an example of spark-deployer.conf
(settings commented out with #
are optional):
cluster-name = "pishen-spark"
keypair = "pishen"
pem = "/home/pishen/.ssh/pishen.pem"
region = "us-west-2"
# ami = "ami-acd63bcc"
# user = "ubuntu"
# root-device = "/dev/sda1"
master {
instance-type = "c4.large"
disk-size = 8
driver-memory = "2G"
}
worker {
instance-type = "c4.xlarge"
disk-size = 40
executor-memory = "6G"
}
# retry-attempts = 20
spark-tgz-url = "http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.4.tgz"
main-class = "mypackage.Main"
# app-name = "my-app-name"
# security-group-ids = ["sg-xxxxxxxx", "sg-yyyyyyyy"]
# subnet-id = "subnet-xxxxxxxx"
# use-private-ip = true
# spark-env = [
# "SPARK_WORKER_CORES=3",
# "SPARK_WORKER_MEMORY=6G"
# ]
# destroy-on-fail = true
# thread-pool-size = 100
# enable-s3a = true
- You can provide your own
ami
, the image should be HVM EBS-Backed with Java 7+ installed. - You can provide your own
user
, which will be the username used to login AWS machine. - You can provide your own
root-device
, which will be your root volume's name that can be enlarged bydisk-size
inmaster
andworker
settings. - Currently tested
instance-type
s aret2.medium
,m3.medium
, andc4.xlarge
. All the M3, M4, C3, and C4 types should work, please report an issue if you encountered a problem. - Value of
disk-size
is in GB, which should be at least 8. It resets the size of root partition, which is used by both OS and Spark. driver-memory
andexecutor-memory
are the memory available for Spark, you may subtract 2G from the physically available memory on that machine.- Some steps of the deployment (ex. SSH login) may fail at the first time. In default, spark-deployer will retry 10 times before it throw an exception. You can change the number of retries at
retry-attempts
. spark-tgz-url
specifies the location of Spark tarball for each machine to download.- You may find one tarball at Spark Downloads.
- To install a different version of Spark, just replace the tarball with the corresponding version.
- The URL can also be a S3 path starting with
s3://
(If you use a custom ami, please make sure you haveawscli
installed on it). - The URL must ends with
/<spark-folder-name>.tgz
for the auto deployment to work.
security-group-ids
specify a list of security groups to apply on the machines.- Since akka use random port to connect with master, the security groups should allow all the traffic between machines in the cluster.
- Allow port 22 for SSH login.
- Allow port 8080, 8081, 4040 for web console (optional).
- Please check Spark security page for more information about port settings.
spark-env
adds the additional Spark settings toconf/spark-env.sh
on each node. Note thatSPARK_MASTER_IP
,SPARK_MASTER_PORT
, andSPARK_PUBLIC_DNS
are hard-coded for now.destroy-on-fail
: if set totrue
, destroy the cluster when spark-deployer met an error insparkCreateCluster
orsparkSubmitJob
. Note that you still need to destroy the cluster by yourself if no error happens.enable-s3a
: if set totrue
, add the support for s3a (require hadoop 2.0+). We use the workaround as described here.