Skip to content

Hopsworks - Full-stack platform for scale-out data science

License

Notifications You must be signed in to change notification settings

farah-nisar/hopsworks

 
 

Repository files navigation

Hopsworks

Join the chat at https://gitter.im/hopshadoop/hopsworks Google Group

Overview

Hopsworks is a platform for both the design and operation of data analytics and machine learning applications. You can design ML applications in Jupyter notebooks in Python and operate them in workflows orchestrated by Airflow, while running on HopsFS, the world's most scalable HDFS-compatible distributed hierarchical filesystem (peer-reviewed, 1.2m ops/sec on Spotify's Hadoop workload). HopsFS also solves the small-files problem of HDFS, by storing small files on NVMe disks in the horizontally scalable metadata layer. Hopsworks is also a platform for data engineering, with support for Spark, Flink, and Kafka. As an on-premises platform, Hopsworks has unique support for project-based multi-tenancy, horizontally scalable ML pipelines, and managed GPUs-as-a-resource.

Multi-tenancy - Projects, Users, Datasets

Hopsworks provides Projects as a privacy-by-design sandbox for data, including sensitive data, and for managing collaborating teams - like GitHub. Datasets can be shared between projects - like Dropbox. Each project has its own Anaconda environment, enabling python dependencies to be managed by the data scientists themselves.

HopsML

HopsML is our framework for writing end-to-end machine learning workflows in Python. We support Airflow to orchestrate workflows with: ETL in PySpark or TensorFlow, a Feature Store, AutoML hyperparameter optimization techniques over many hosts and GPUs in Keras/TensorFlow/PyTorch, in addition to distributed training such as Collective AllReduce.

Jupyter notebooks can be used to write all parts of the pipeline, and TensorBoard to visualize experiment results during and after training. Models can be deployed in Kubernetes (built-in or external) and monitored in production using Kafka/Spark-Streaming. For more information see the docs.

Feature Store

The feature store is as a central place to store curated features for machine learning pipelines in Hopsworks. A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. Features can come directly from tables or files or can be derived values, computed from one or more data sources. For more information see the docs.

TLS security

Uniquely in Hadoop, Hops supports X.509 certificates for authentication and authorization: users, services, jobs and TLS for in-flight encryption. At-rest encryption is also supported using ZFS-on-Linux.

HopsFS

HopsFS is a drop-in replacement for HDFS that adds distributed metadata and "small-files in metadata (NVMe disks)" support to HDFS.

Information

Installing Hopsworks

Installation of Hopsworks and all its services is automated with the Karamel software. Instructions on how to install the entire platform are available here.

For a local single-node installation, to access Hopsworks just point your browser at:

  http://localhost:8080/hopsworks
  usename: [email protected]
  password: admin

Admin email may differ on your installation. Please refer to your Karamel cluster definition to access/set the email.

Build instructions

Hopsworks consists of the backend module which is packaged in two files, hopsworks.ear and hopsworks-ca.war, and the front-end module which is packaged in a single .war file.

Build Requirements (for Ubuntu)

NodeJS server and bower, both required for building the front-end.

sudo apt install nodejs-legacy
sudo apt-get install npm
sudo npm cache clean
# You must have a version of bower > 1.54
sudo npm install bower -g
sudo npm install grunt -g

Build with Maven

mvn install

Maven uses yeoman-maven-plugin to build both the front-end and the backend. Maven first executes the Gruntfile in the yo directory, then builds the back-end in Java. The yeoman-maven-plugin copies the dist folder produced by grunt from the yo directory to the target folder of the backend.

You can also build Hopsworks without the frontend (for Java EE development and testing):

mvn install -P-web

Front-end Development

The javascript produced by building maven is obsfuscated. For debugging javascript, we recommend that you use the following script to deploy changes to HTML or javascript to your vagrant machine:

cd scripts
./js.sh

You should also add the chef recipe to the end of your Vagrantfile (or Karamel cluster definition):

 hopsworks::dev

For development

You can build Hopsworks without running grunt/bower using:

mvn install -P-dist

Then run your script to upload your javascript to snurran.sics.se:

cd scripts
./deploy.sh [yourName]

Testing Guide

The following steps must be taken to run Hopsworks integration tests:

-Warning: This test will clean hdfs and drop Hopsworks database. So it should only be used on a test machine.

First create a .env file by copying the .env.example file. Then edit the .env file by providing your specific configuration.

   cd hopsworks/hopsworks-IT/src/test/ruby/
   cp .env.example .env

Then export environments to match the server you are deploying to:

   GLASSFISH_HOST_NAME=localhost
   GLASSFISH_HTTP_PORT=8181
   GLASSFISH_ADMIN_PORT=4848

Change the server login credentials in hopsworks-IT/pom.xml

  <properties>
    ...
    <glassfish.admin>{username}</glassfish.admin>
    <glassfish.passwd>{password}</glassfish.passwd>
    ...
  </properties>

Export environments for Selenium integration test:

   HOPSWORKS_URL=http://localhost:8181/hopsworks
   HEADLESS=[true|false]
   BROWSER=[chrome|firefox]

To compile, deploy and run the integration test:

   cd hopsworks/
   mvn clean install -Pjruby-tests

If you have already deployed hopsworks-ear and just want to run the integration test:

   cd hopsworks/hopsworks-IT/src/test/ruby/
   bundle install
   rspec --format html --out ../target/test-report.html

To run a single test

   cd hopsworks/hopsworks-IT/src/test/ruby/
   rspec ./spec/session_spec.rb:60

To skip tests that need to run inside a vm

   cd hopsworks/hopsworks-IT/src/test/ruby/
   rspec --format html --out ../target/test-report.html --tag ~vm:true

When the test is done if LAUNCH_BROWSER is set to true in .env, it will open the test report in a browser.

About

Hopsworks - Full-stack platform for scale-out data science

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 59.5%
  • HTML 15.8%
  • JavaScript 15.0%
  • Ruby 5.8%
  • CSS 3.7%
  • Shell 0.2%