-
Notifications
You must be signed in to change notification settings - Fork 1
Running Spark on Amazon EC2
This guide describes how to get Spark running on an EC2 cluster. It assumes you have already signed up for Amazon EC2 account on the Amazon Web Services site.
Spark now includes some EC2 Scripts for launching and managing clusters on EC2. You can typically launch a cluster in about five minutes. Follow the instructions at this link for details.
Older versions of Spark use the EC2 launch scripts included in Mesos. You can use them as follows:
- Download Mesos using the instructions on the Mesos wiki. There's no need to compile it.
- Launch a Mesos EC2 cluster following the EC2 guide on the Mesos wiki. (Essentially, this involves setting some environment variables and running a Python script.)
- Log into your EC2 cluster's master node using
mesos-ec2 -k <keypair> -i <key-file> login cluster-name
. - Go into the
spark
directory inroot
's home directory. - Run either
spark-shell
or another Spark program, setting the Mesos master to use tomaster@<ec2-master-node>:5050
. You can also find this master URL in the file~/mesos-ec2/cluster-url
in newer versions of Mesos. - Use the Mesos web UI at
http://<ec2-master-node>:8080
to view the status of your job.
The Spark EC2 machines may not come with the latest version of Spark. To use a newer version, you can run git pull
to pull in /root/spark
to pull in the latest version of Spark from git
, and build it using sbt/sbt compile
. You will also need to copy it to all the other nodes in the cluster using ~/mesos-ec2/copy-dir /root/spark
.
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form s3n://<id>:<secret>@<bucket>/path
, where <id>
is your Amazon access key ID and <secret>
is your Amazon secret access key. Note that you should escape any /
characters in the secret key as %2F
. Full instructions can be found on the Hadoop S3 page.
In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.