Skip to content

marchlo/eddn_spark_compose

Repository files navigation

EDDN Spark Compose

This is a local Apache Spark cluster with Apache Cassandra database, which can be built quickly and easily using Docker Compose. The focus is on the integration of Elite Dangerous (EDDN) data, which is loaded directly into Cassandra. This makes it possible to run PySpark tests and analyses with Spark directly after initialization.

Table of Content

Installation

Usage

References

Installation

To use the cluster, it is required to install:

  1. Python 3
  2. Docker
  3. Docker Compose

Docker is needed for the individual components, each of them running in its own container. Docker Compose starts all containers together. And Python 3 is used to load the Elite Dangerous data.

Python 3

It is recomanded to install the latest version of python 3.

Linux

Before installing the latest version, check for a currently installed python 3 version.
To check this, run:

$ python3 --version

If a version of python 3 is installed you can upgrade this to latest version:

$ sudo apt-get upgrade python3 

If you want to install the latest version of python3 you can run:

$ sudo apt-get install python3 

After the installation you need to check if pip (python package manager) is installed along with your python installation:

$ pip3 -V 

If pip isn't installed on your machine, run the following command to install it:

$ sudo apt install python3-pip

Windows

Download the excecutable installer from https://www.python.org/downloads/windows/ .
Afterwards execute the installer and following the instruction.
It also is possible to install python with Anaconda or with configuration in power shell.

MacOS

Before installing python make sure that Xcode and Homebrew are installed on your computer.

If that is not the case, then run this in the terminal to install Xcode:

$ xcode-select –install 

and this to install Homebrew:

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" 

Now check if a python version is already installed:

$ python3 --version 

Is a version installed you can upgrade this to the latest version:

$ brew update 
$ brew upgrade python3 

To install python 3 run this command in your terminal:

$ brew install python3 

After the installation you need to check if pip (python package manager) is installed along with your python installation:

$ pip3 -V 

If pip isn't installed on your machine, run the following command to install it:

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py 
$ python3 get-pip.py

Docker

Linux

For detailed information take a look at the Docker Documentation, the first Link in chapter References.

Uninstall the old version

Make sure that no outdated Docker version is installed:

$ sudo apt-get remove docker docker-engine docker.io containerd runc 

Set up the Repository

Update the apt package index:

$ sudo apt-get update

Then install packages to allow apt to use a repository over HTTPS:

$ sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

Now add Docker’s official GPG key:

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Afterwards check if the key with the fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 was added. Use the last 8 characters of the fingerprint for searching.

$ sudo apt-key fingerprint 0EBFCD88

At last use the following command to set up the stable repository.

$ sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

Install the current version of Docker

Update the apt package index:

$ sudo apt-get update 

Install the latest version of Docker:

$ sudo apt-get install docker-ce docker-ce-cli containerd.io

Microsoft Windows

If you haven’t already downloaded the installer (Docker Desktop Installer.exe), you can get it from download.docker.com.

  1. Double-click Docker Desktop for Windows Installer.exe to run the installer.

  2. Follow the install wizard to accept the license, authorize the installer, and proceed with the installation.

  3. Click Finish on the setup dialog to complete and launch Docker.

MacOS

To install Docker Desktop for Mac download the Docker.dmg from Docker Hub.
It is required to sign up on docker hub to download docker for mac.

  1. Double-click Docker.dmg to open the installer, then drag Moby the whale to the Applications folder.

  2. Double-click Docker.app in the Applications folder to start Docker.

Docker Compose

Linux

To install docker compose with curl, run this command to download the current stable release of Docker Compose:

sudo curl -L "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose 

Apply executable permissions to the binary:

sudo chmod +x /usr/local/bin/docker-compose 

Alternativly you can use pip to install docker compose:

sudo pip install docker-compose 

Test if the installation was successful:

$ docker-compose –version 

Windows and MacOS

The desktop version of docker includes docker compose.
So the installation is already done.

Usage

The execution is divided in single shell scripts. The functionality and benefits are explained in the chapter project section. There is also a shell script with which all steps can executed in the correct order. The project is designed for Linux systems, but can be ported with adaptations of the shell scripts for the respective operating system.

Project sections

Following all steps are described that you need to run in the chapter order to make the project/cluster works.

Load Data from eddb.io

The used data comes from the game Elite Dangerous (EDDN) and is provided by the API of the website eddb.io.
With the python script EDDNClient.py the data is read by the API and written in JSON format into a .log file (Logs_JSON_EDDN_yyyy-mm-dd).
The number of downloaded datasets/rows is defined by the argument -d, --datasets.
Afterwards the .log file is transformed with the script transform_to_csv.py into a CSV format to make it suitable for Cassandra.

For the execution use the shell script download_and_transform_data.sh with the necessary argument:

$ bash download_and_transform_data.sh -d <Number of datasets that should be downloaded> 

or

$ bash download_and_transform_data.sh --datasets=<Number of datasets that should be downloaded>

Run Docker-Compose file

The cluster can be created after the database has been set up. To do this, use the shell script run_docker_compose.sh. The script expects an argument to specify the number of Spark nodes/slaves. Then Docker Compose is used to build the cluster with the scaled number of Spark nodes.

$ bash run_docker_compose.sh -n <number of nodes/workers that will be created> 

or

$ bash run_docker_compose.sh --nodes=<number of nodes/workers that will be created> 

Copy Data to Cassandra

After the cluster is initialized, the EDDN data can be loaded into the database. Execute the script:

$ bash load_data_into_cassandra.sh 

This script calls the Cassandra file copy_data.cql which creates the keyspace and the table and loads the data from the CSV file.

Run PySpark skripts

The provided PySpark script eddb_data.py is just an example. It will select the whole table and write the result into a CSV file in the folder ./compose_cluster/export_data.

Use this shell script to run PySpark:

$ bash exec_pyspark_scripts.sh 

In this script you can also insert your own PySpark scripts or replace the existing one to execute them.

Note: A pandas function is used to create the CSV. However, this is only useful for small amounts of data, because it loads the data into the RAM before writing. As a result, the RAM runs full if the amount of data is too large. To avoid this you can comment out the following line in the PySpark script eddb_data.py:

# df_data.toPandas().to_csv('/tmp/check_cass_data.csv', header=True, encoding='utf8')

Run all in one

To avoid that each step has to be executed separately, a shell script with two arguments can be used:

$ bash run_all.sh -d <Number of datasets that should be downloaded> -n <number of nodes/workers that will be created>

or

$ bash run_all.sh --dataset=<Number of datasets that should be downloaded> --nodes=<number of nodes/workers that will be created>

This script executes all steps in the correct order.

Note: During the first execution of the script it is possible that the data is not copied and the PySpark script is not executed correctly. This happens because of the initialization time of the Docker Container. If this problem occurs, the error can be fixed by executing the shell script again. Since it is recognized that the cluster already exists when the script is executed again, only the missing steps will be performed. But each time the cluster is executed, the amount of loaded datasets must be specified. Therefore, it is recommended to set this number to a low value when the cluster is executed again. After the cluster has built up, there will be no recurrent complications when running the script again.

Remove and clean

To remove the cluster with all docker containers and the docker images, use the script:

$ bash remove_docker_container_images.sh 

To delete all created data files, run the script:

$ bash clean_folders_from_files.sh 

References

These are the references that were used for the creation of the readme.