Big Data Hackaton: Deploy a Spark Standalone cluster in one machine using Vagrant

STEP 1: Make the base setup

Download and install VirtualBox (pick the one that corresponds to your system & prefer version the latest version 5.2.x): https://www.virtualbox.org/wiki/Downloads
- For Ubuntu (Debian) users see: https://github.com/mnmami/Training/blob/master/VirtualBox_Ubuntu.md
Download and install Vagrant (pick the one that corresponds to your system & prefer the latest version 2.0.x): https://www.vagrantup.com/downloads.html
- For Linux (Debian) users see: https://github.com/mnmami/Training/blob/master/Vagrant_Linux.md

STEP 2: Get and configure the environment

Open a terminal and create a folder for your Vagrant project then navigate to it:

mkdir myvagrant
cd myvagrant

Create a file called Vagrantfile and put inside it:

Vagrant.configure("2") do |config|
  config.vm.provision "shell", inline: "echo Hello there"
  # config.ssh.insert_key = false

  config.vm.define "master" do |master|
    master.vm.box = "ubuntu/xenial64"
    master.vm.network "public_network", ip: "192.168.0.10"
    master.vm.network "forwarded_port", guest: 4040, host: 4040
    master.vm.network "forwarded_port", guest: 8080, host: 8080
    master.vm.hostname = "ubuntu1"
  end

  config.vm.define "slave" do |slave|
    slave.vm.box = "ubuntu/xenial64"
    slave.vm.network "public_network", ip: "192.168.0.11"
    slave.vm.network "forwarded_port", guest: 8081, host: 8081
    slave.vm.hostname = "ubuntu2"
  end
 end

Windows users
- Uncomment third line # config.ssh.insert_key = false
- do not use sudo in all the command lines of this step
Then run: sudo vagrant up and wait a few minutes
- If you get asked which 'network interface' you should use, select the one you are connected to. For example, if you use an ethernet, common names are eth0 or em0. To be sure, run ifconfig and pick the one showing your current ip address.

STEP 3: Connect (SSH) to the Master and Slave boxes

Once STEP 2 is done successfully, we obtain two Linux 16.04 boxes (guest virtual machines) connected between them using a (public) network. One will be used as Apache Spark Master, the other for the slave. We also exposed the ports 4040, 8080 and 8181 to the host machine (that runs Vagrant). We use those ports to open web interfaces to the master and slave.

Now, ssh to the master using sudo vagrant ssh master and open another terminal and ssh to the slave using sudo vagrant ssh slave. Now you are moving to an Ubuntu System.
In both boxes run to install the missing packages: sudo apt-get update
If the command hangs with the message [Connecting to archive.ubuntu.com (2001:67c:1360:8c01::1a)], solve it by disabling ip6, solve it using the steps here: https://askubuntu.com/questions/440649/how-to-disable-ipv6-in-ubuntu-14-04

STEP 4: Install Java (in both boxes)

Run the dollowing 2 lines:

sudo apt-get install openjdk-8-jre
sudo apt-get update

STEP 5: Download and configure Spark (in both boxes)

we will install version 2.1, so run:

sudo wget https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
sudo tar -xzvf  spark-2.1.0-bin-hadoop2.7.tgz 
cd spark-2.1.0-bin-hadoop2.7

Navigate to the conf folder and create Spark configurations file:

cd conf
sudo cp spark-env.sh.template spark-env.sh

Open spark-env.sh for editing and add the following line:

export SPARK_MASTER_HOST=192.168.0.10

STEP 6: Start Spark

In the Master box, navigate to the sbin folder and execute start-master.sh script:

cd ../sbin
sudo ./start-master.sh

This will return a message mentioning a logging file, open it to obtain the master URL. You should find spark://192.168.0.10:7077.
In the Slave box, also navigate to the sbin folder and execute start-slave.sh script passing Spark URL in argument:

cd ../sbin
sudo ./start-slave.sh spark://192.168.0.10:7077

STEP 7: Open Spark Shell

Navigate to the bin folder and run spark-shell script passing Spark URL in argument:

cd ../bin
sudo ./spark-shell --master spark://192.168.0.10:7077

Here is an example https://github.com/mnmami/Training/blob/master/Example.scala.
- Transform a SQL dumpt into comma-separated values.
- Create DataFrames from those values and query them.
- Save DataFrame to a CSV file.

Helpful tips

Use vagrant snapshot push/pull (see here https://www.vagrantup.com/docs/cli/snapshot.html for more) to create a snapshot (version) of your machine any time, so you can roll back to that version when things go wrong, no need to destroy and start anew.
Use vagrant suspend/resume (see here https://www.vagrantup.com/docs/cli/suspend.html for more) to save the state of the machine and pick up where you left off the last time, and avoid to start from scratch.

Have a question on the above? no panic, shoot me an email at: mami@cs.uni-bonn.de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Big Data Hackaton: Deploy a Spark Standalone cluster in one machine using Vagrant

STEP 1: Make the base setup

STEP 2: Get and configure the environment

STEP 3: Connect (SSH) to the Master and Slave boxes

STEP 4: Install Java (in both boxes)

STEP 5: Download and configure Spark (in both boxes)

STEP 6: Start Spark

STEP 7: Open Spark Shell

Helpful tips

Files

README.md

Latest commit

History

README.md

File metadata and controls

Big Data Hackaton: Deploy a Spark Standalone cluster in one machine using Vagrant

STEP 1: Make the base setup

STEP 2: Get and configure the environment

STEP 3: Connect (SSH) to the Master and Slave boxes

STEP 4: Install Java (in both boxes)

STEP 5: Download and configure Spark (in both boxes)

STEP 6: Start Spark

STEP 7: Open Spark Shell

Helpful tips