Skip to content

lsds/sgx-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

7ce0009 · Apr 24, 2019
Nov 23, 2016
May 10, 2018
Mar 15, 2018
Jun 1, 2018
Feb 12, 2019
May 10, 2018
Jul 13, 2017
Feb 12, 2019
Jun 14, 2018
Feb 27, 2019
Jan 26, 2018
May 24, 2018
Jun 5, 2017
Apr 24, 2019
Jun 1, 2018
Mar 1, 2019
Feb 12, 2019
Feb 12, 2019
Feb 26, 2019
Feb 12, 2019
Feb 12, 2019
Nov 1, 2017
Mar 1, 2019
Feb 12, 2019
Feb 12, 2019
Feb 13, 2019
Jan 15, 2018
Jun 1, 2018
Feb 12, 2019
Feb 12, 2019
May 10, 2018
Jun 13, 2018
Feb 20, 2019
Feb 20, 2019
Feb 12, 2019
Feb 12, 2019
Feb 12, 2019
Oct 31, 2014
Jul 26, 2017
Mar 15, 2018
Nov 29, 2017
Nov 23, 2016
May 10, 2018
Nov 29, 2017
Apr 4, 2019
Apr 4, 2019
Dec 17, 2017
Feb 27, 2019
Feb 20, 2019
Feb 20, 2019
Jun 29, 2017
Mar 1, 2019
Feb 21, 2019
Feb 20, 2019
Feb 20, 2019
Mar 4, 2019
Mar 1, 2019
Feb 20, 2019
Feb 21, 2019
Feb 21, 2019
Feb 21, 2019
Feb 21, 2019
Feb 21, 2019
Feb 21, 2019
Feb 21, 2019
Feb 21, 2019
Mar 1, 2019
Feb 26, 2019
Feb 21, 2019
Feb 20, 2019
Feb 21, 2019
Feb 21, 2019
Feb 21, 2019

Repository files navigation

Sgx-Spark

This is Apache Spark with modifications to run security sensitive code inside Intel SGX enclaves. The implementation leverages sgx-lkl, a library OS that allows to run Java-based applications inside SGX enclaves.

Docker quick start

This guide shows how to run Sgx-Spark in a few simple steps using Docker. Most parts of the setup and deployment are wrapped within Docker containers. Compliation and deployment should thus be smooth.

Preparing the Sgx-Spark Docker environment

  • Clone this Sgx-Spark repository

  • Build the Sgx-Spark base image. The name of the resulting Docker image is sgxpsark. This process might take a while (30-60 mins):

      sgx-spark/dockerfiles$ docker build -t sgxspark .
    
  • Prepare the disk image that will be required by sgx-lkl. Due to restrictions of Docker, this step can currently not be implemented as part of the above Docker build process. Thus, this step is platform-dependent. The process has been successfully tested on Ubuntu 16.04 and Arch Linux:

      sgx-spark/lkl$ make prepare-image
    
  • Create a Docker network device that will be used for communication by the Docker containers. Note that by creating a user-defined network, Docker will create an embedded DNS server so that workers can find the Spark master by name.

      sgx-spark$ docker network create sgxsparknet
    

Running Sgx-Spark jobs using Docker

From within directory sgx-spark/dockerfiles, run the Sgx-Spark master node, the Sgx-Spark worker node, as well as the actual Sgx-Spark job as follows.

  • Run the Sgx-Spark master node:

      sgx-spark/dockerfiles$ docker run \
      --user user \
      --env-file $(pwd)/docker-env \
      --net sgxsparknet \
      --name sgxspark-docker-master \
      -p 7077:7077 \
      -p 8082:8082 \
      -ti sgxspark /sgx-spark/master.sh
    
  • Run the Sgx-Spark worker node:

      sgx-spark/dockerfiles$ docker run \
      --user user \
      --memory="4g" \
      --shm-size="8g" \
      --env-file $(pwd)/docker-env \
      --net sgxsparknet \
      --privileged \
      -v $(pwd)/../lkl:/spark-image:ro \
      -ti sgxspark /sgx-spark/worker-and-enclave.sh
    
  • Run the Sgx-Spark job as follows.

    As of writing, the three jobs below are known to be fully supported:

    • WordCount

        sgx-spark/dockerfiles$ docker run \
        --user user \
        --memory="4g" \
        --shm-size="8g" \
        --env-file $(pwd)/docker-env \
        --net sgxsparknet \
        --privileged \
        -v $(pwd)/../lkl:/spark-image:ro \
        -e SPARK_JOB_CLASS=org.apache.spark.examples.MyWordCount \
        -e SPARK_JOB_NAME=WordCount \
        -e SPARK_JOB_ARG0=README.md \
        -e SPARK_JOB_ARG1=output \
        -ti sgxspark /sgx-spark/driver-and-enclave.sh
      
    • KMeans

        sgx-spark/dockerfiles$ docker run \
        --user user \
        --memory="4g" \
        --shm-size="8g" \
        --env-file $(pwd)/docker-env \
        --net sgxsparknet \
        --privileged \
        -v $(pwd)/../lkl:/spark-image:ro \
        -e SPARK_JOB_CLASS=org.apache.spark.examples.mllib.KMeansExample \
        -e SPARK_JOB_NAME=KMeans \
        -e SPARK_JOB_ARG0=data/mllib/kmeans_data.txt \
        -ti sgxspark /sgx-spark/driver-and-enclave.sh
      
    • LineCount

        sgx-spark/dockerfiles$ docker run \
        --user user \
        --memory="4g" \
        --shm-size="8g" \
        --env-file $(pwd)/docker-env \
        --net sgxsparknet \
        --privileged \
        -v $(pwd)/../lkl:/spark-image:ro \
        -e SPARK_JOB_CLASS=org.apache.spark.examples.LineCount \
        -e SPARK_JOB_NAME=LineCount \
        -e SPARK_JOB_ARG0=SgxREADME.md \
        -ti sgxspark /sgx-spark/driver-and-enclave.sh
      

Native compilation, installation and deployment

To run Sgx-Spark natively, proceed as detailed in the following.

Install package dependencies

Install all required dependencies. For Ubuntu 16.04, these can be installed as follows:

$ sudo apt-get update
$ sudo apt-get install -y --no-install-recommends scala libtool autoconf curl xutils-dev git build-essential maven openjdk-8-jdk ssh bc python autogen wget autotools-dev sudo automake

Compile and install Google Protocol Buffer 2.5.0

Hadoop, and thus Spark, depends on Google Protocol Buffers (GPB) in version 2.5.0:

  • Make sure to uninstall any other versions of GPB

  • Install GPB v2.5.0. Instructions for Ubuntu 16.04 are as follows:

      $ cd /tmp
      /tmp$ wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
      /tmp$ tar xvf protobuf-2.5.0.tar.gz
      /tmp$ cd protobuf-2.5.0
      /tmp/protobuf-2.5.0$ ./autogen.sh && ./configure && make && sudo make install
      /tmp/protobuf-2.5.0$ sudo apt-get install -y --no-install-recommends libprotoc-dev
    

    Instructions for Arch Linux are available at https://stackoverflow.com/a/29799354/2273470.

Compile sgx-lkl

As Sgx-Spark uses sgx-lkl, the latter must have been downloaded and compiled successfully. As of writing (June 14, 2018), sgx-lkl should be compiled using branch cleanup-musl. Please follow the documentation of sgx-lkl and ensure that your installation of sgx-lkl executes simple Java applications successfully.

Compile Sgx-Spark

sgx-spark$ build/mvn -DskipTests package
  • As part of this compilation process, a modified Hadoop library has been compiled. Copy the Hadoop JAR file into the Sgx-Spark jars directory:

      sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
    
  • Sgx-Spark ships with a native C library (libringbuff.so) that enables shared-memory-based communication between two JVMs. Compile as follows:

      sgx-spark/C$ make install
    

Prepare the Sgx-Spark disk images that will be run using sgx-lkl

  • Adjust file spark-sgx/lkl/Makefile for your environment:

    Variable SGX_LKL must point to your sgx-lkl directory (see Prerequisites).

  • Build the Sgx-Spark disk image required for sgx-lkl:

      sgx-spark/lkl$ make clean all
    

Run Sgx-Spark using sgx-lkl

Finally, we are ready to run (i) the Sgx-Spark master node, (ii) the Sgx-Spark worker node, (iii) the worker's enclave, (iv) the Sgx-Spark client, and (v) the client's enclave. In the following commands, replace: <hostname> with the master node's actual hostname; <sgx-lkl> with the path to your sgx-lkl installation.

Note: After running each example, make sure to (i) restart all processes, (ii) delete all shared memory files in /dev/shm.

  • If you run all the nodes locally, you need to add the following line to variables.sh:

      export SPARK_LOCAL_IP=127.0.0.1
    
  • Run the Master node

      sgx-spark$ ./master.sh
    
  • Run the Worker node

      sgx-spark$ ./worker.sh
    
  • Run the enclave for the Worker node

      sgx-spark$ ./worker-enclave.sh
    
  • Run the enclave for the driver program. This is the process that will output the job results!

      sgx-spark$ ./driver-enclave.sh
    
  • Finally, submit a Spark job. The result will be output in the process we started just before.

    • WordCount

        sgx-spark$ ./submitwordcount.sh
      
    • KMeans

        sgx-spark$ ./submitkmeans.sh
      
    • LineCount

        sgx-spark$ ./submitlinecount.sh
      

Native execution of the same Spark installation

In order to run the above installation without SGX, start your environment as follows:

  • Start the Master node as above

  • Start the Worker node as above, but change environment variable SGX_ENABLED=true to SGX_ENABLED=false

  • Do not start the enclaves

  • Submit the Spark job as above, but change evironment variable SGX_ENABLED=true to SGX_ENABLED=false

Important developer notes

Code changes and recompilation

There are a few important things to keep in mind when developing Sgx-Spark:

  • Whenever you change parts of the code, obviously, you must recompile the Spark code

      sgx-spark$ mvn package -DskipTests
    

    There have been (not clearly definable) situations in which the above command did not compile all of the changed files. In this case, issue:

      sgx-spark$ mvn clean package -DskipTests
    
  • After making changes to the Sgx-Spark code and after compiling the Java/Scala code (see above), you always need to rebuild the lkl image that will be used by sgx-lkl:

      sgx-spark/lkl$ make clean all
    
  • If you changed parts of the Hadoop code (in directory hadoop-2.6.5-src), you will also need to copy the resulting *jar file:

      sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
    
  • Lastly, do not forget to remove all related shared memory files in /dev/shm/ before running your next experiment!

Running without sgx-lkl

Development with sgx-lkl can be tedious. For development purposes, a special flag allows to run the enclave-side of Sgx-Spark in a regular JVM rather than on top of sgx-lkl. To make use of this feature, run the enclave JVMs using scripts worker-enclave-nosgx.sh and driver-enclave-nosgx.sh.

Under the hood, these scripts set environment variable DEBUG_IS_ENCLAVE_REAL=false (defaults to true) and provide the JVM with a value for environment variable SGXLKL_SHMEM_FILE. Note that the value of SGXLKL_SHMEM_FILE must be the same as the one provided for the corresponding Worker (worker.sh) and Driver (driver.sh).