Hopsworks is a platform for both the design and operation of data analytics and machine learning applications. You can design ML applications in Jupyter notebooks in Python and operate them in workflows orchestrated by Airflow, while running on HopsFS, the world's most scalable HDFS-compatible distributed hierarchical filesystem (peer-reviewed, 1.2m ops/sec on Spotify's Hadoop workload). HopsFS also solves the small-files problem of HDFS, by storing small files on NVMe disks in the horizontally scalable metadata layer. Hopsworks is also a platform for data engineering, with support for Spark, Flink, and Kafka. As an on-premises platform, Hopsworks has unique support for project-based multi-tenancy, horizontally scalable ML pipelines, and managed GPUs-as-a-resource.
Hopsworks provides Projects as a privacy-by-design sandbox for data, including sensitive data, and for managing collaborating teams - like GitHub. Datasets can be shared between projects - like Dropbox. Each project has its own Anaconda environment, enabling python dependencies to be managed by the data scientists themselves.
HopsML is our framework for writing end-to-end machine learning workflows in Python. We support Airflow to orchestrate workflows with: ETL in PySpark or TensorFlow, a Feature Store, AutoML hyperparameter optimization techniques over many hosts and GPUs in Keras/TensorFlow/PyTorch, in addition to distributed training such as Collective AllReduce.
Jupyter notebooks can be used to write all parts of the pipeline, and TensorBoard to visualize experiment results during and after training. Models can be deployed in Kubernetes (built-in or external) and monitored in production using Kafka/Spark-Streaming. For more information see the docs.
The feature store is as a central place to store curated features for machine learning pipelines in Hopsworks. A feature is a measurable property of some data-sample. It could be for example an image-pixel, a word from a piece of text, the age of a person, a coordinate emitted from a sensor, or an aggregate value like the average number of purchases within the last hour. Features can come directly from tables or files or can be derived values, computed from one or more data sources. For more information see the docs.
Uniquely in Hadoop, Hops supports X.509 certificates for authentication and authorization: users, services, jobs and TLS for in-flight encryption. At-rest encryption is also supported using ZFS-on-Linux.
HopsFS is a drop-in replacement for HDFS that adds distributed metadata and "small-files in metadata (NVMe disks)" support to HDFS.
Installation of Hopsworks and all its services is automated with the Karamel software. Instructions on how to install the entire platform are available here.
For a local single-node installation, to access Hopsworks just point your browser at:
http://localhost:8080/hopsworks
usename: [email protected]
password: admin
Admin email may differ on your installation. Please refer to your Karamel cluster definition to access/set the email.
Hopsworks consists of the backend module which is packaged in two files, hopsworks.ear
and hopsworks-ca.war
,
and the front-end module which is packaged in a single .war
file.
NodeJS server and bower, both required for building the front-end.
sudo apt install nodejs-legacy
sudo apt-get install npm
sudo npm cache clean
# You must have a version of bower > 1.54
sudo npm install bower -g
sudo npm install grunt -g
mvn install
Maven uses yeoman-maven-plugin to build both the front-end and the backend. Maven first executes the Gruntfile in the yo directory, then builds the back-end in Java. The yeoman-maven-plugin copies the dist folder produced by grunt from the yo directory to the target folder of the backend.
You can also build Hopsworks without the frontend (for Java EE development and testing):
mvn install -P-web
The javascript produced by building maven is obsfuscated. For debugging javascript, we recommend that you use the following script to deploy changes to HTML or javascript to your vagrant machine:
cd scripts
./js.sh
You should also add the chef recipe to the end of your Vagrantfile (or Karamel cluster definition):
hopsworks::dev
You can build Hopsworks without running grunt/bower using:
mvn install -P-dist
Then run your script to upload your javascript to snurran.sics.se:
cd scripts
./deploy.sh [yourName]
The following steps must be taken to run Hopsworks integration tests:
-Warning: This test will clean hdfs and drop Hopsworks database. So it should only be used on a test machine.
First create a .env file by copying the .env.example file. Then edit the .env file by providing your specific configuration.
cd hopsworks/hopsworks-IT/src/test/ruby/
cp .env.example .env
Then export environments to match the server you are deploying to:
GLASSFISH_HOST_NAME=localhost
GLASSFISH_HTTP_PORT=8181
GLASSFISH_ADMIN_PORT=4848
Change the server login credentials in hopsworks-IT/pom.xml
<properties>
...
<glassfish.admin>{username}</glassfish.admin>
<glassfish.passwd>{password}</glassfish.passwd>
...
</properties>
Export environments for Selenium integration test:
HOPSWORKS_URL=http://localhost:8181/hopsworks
HEADLESS=[true|false]
BROWSER=[chrome|firefox]
To compile, deploy and run the integration test:
cd hopsworks/
mvn clean install -Pjruby-tests
If you have already deployed hopsworks-ear and just want to run the integration test:
cd hopsworks/hopsworks-IT/src/test/ruby/
bundle install
rspec --format html --out ../target/test-report.html
To run a single test
cd hopsworks/hopsworks-IT/src/test/ruby/
rspec ./spec/session_spec.rb:60
To skip tests that need to run inside a vm
cd hopsworks/hopsworks-IT/src/test/ruby/
rspec --format html --out ../target/test-report.html --tag ~vm:true
When the test is done if LAUNCH_BROWSER
is set to true in .env
, it will open the test report in a browser.