Skip to content

Data stack integrating several open-source data projects (trino, hive metastore, minIO, etc)

Notifications You must be signed in to change notification settings

atiftariq1/ModrenLakeHouseWorking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modern data stack

Minimal example integrating docker images of the following Big Data open-source projects:

There are also some technologies tested but finally discarted (have a look to the incompatibilities section):

  • Iceberg REST catalog: v.16 - not compatible with trino when warehouse is NOT in AWS S3

Since the open-source big data ecosystem is vibrant, this modern-data-stack is always evolving. Currently, only the above projects are integrated but in a near future, other complementary and promising projects will be considered like:

  • OpenMetadata (data catalog)
  • Apache Ranger (data security)

Installation

Start all docker containers with:

docker-compose up -d

Initializing datalake

Since this repo is for teaching purposes, a .env file is provided. The provided access keys are totally disposable and not used in any system. For an easy setup, it's recommended to keep that access keys unaltered to avoid changing the following configuration files:

To provision access keys and creating the bucket in the MinIO server, just type:

docker-compose exec minio bash /opt/bin/init_datalake.sh

Testing installation

Since all Spark source base runs on JVM platform, using spark-shell (instead of pyspark) is recommended for troubleshooting installation errors. This avoid confusing wrapped Python-style errors of JVM components.

To test the installation, just start a scala-based spark shell, just type:

docker-compose exec spark spark-shell 

Just past this code in the shell in order to test the connection between Spark, Iceberg runtime and Hive metastore:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

spark.sql("CREATE DATABASE IF NOT EXISTS nyc300 LOCATION 's3a://warehouse/nyc300';")

val schema = StructType( Array(
    StructField("vendor_id", LongType,true),
    StructField("trip_id", LongType,true),
    StructField("trip_distance", FloatType,true),
    StructField("fare_amount", DoubleType,true),
    StructField("store_and_fwd_flag", StringType,true)
))

val df = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],schema)
df.writeTo("hms.nyc300.test").create()

No errors should be raised. For debugging purposes, log files are always a good place to look at. Log files are stored in spark docker container and can checked like this:

docker-compose exec spark bash
ls -al /home/iceberg/spark-events/*

Using Jupyter notebooks

Once installation is properly set up, using jupyter notebooks is much more covenient than CLI tools. Since python kernel is distributed in the spark-iceberg docker image, all coding examples are developed in python and SQL. No scala is used to simplify reading the code.

Open the notebook called Testing Iceberg tables. This notebook is totally inspired by excelent Iceberg quick start guide.

The Spark hms catalog is configured in this file and passed to the spark container as the spark-defaults.conf file. This file sets Iceberg as default table format for this catalog. It also sets HMS as catalog's metastore.

If everything is properly setup, a new namespace (a.k.a database) called nyc100 will be created in HMS executing the first notebook cell. All tables created in this notebook, using both python spark API and SQL, uses Iceberg table format underneath due to the spark-defaults defined here.

Using Trino

trino client is installed in the trino container. Just connect to it:

docker-compose exec trino trino

Using trino with the iceberg connector sets the default table format to Iceberg. Creating a trino catalog called hms, using Iceberg connector and poiting to HMS, is done in this file. Iceberg tables created from Spark in HMS can be used seamlessly from trino and vice-versa.

CREATE SCHEMA IF NOT EXISTS hms.nyc200 WITH (location = 's3a://warehouse/nyc200');

CREATE TABLE IF NOT EXISTS hms.nyc200.sales_from_trino (
  productcategoryname VARCHAR,
  productsubcategoryname VARCHAR,
  productname VARCHAR,
  customerName VARCHAR,
  salesTerritoryCountry VARCHAR,
  salesOrderNumber VARCHAR,
  orderQuantity INTEGER
);

select * from hms.nyc200.sales_from_trino;

Compatibility issues

Apache Hive MetaStore

[2023/08/26] Since Apache Hive team publishes in docker hub linux/amd64 images for both v3 and v4 (alpha and beta prereleases), I decided to test them:

  • Both alpha v4 and beta v4 prereleases work perfectly well but Trino v425 is only compatible with Hive Metastore Thrift API v3.1. In real life usage, this produces some incompatibility errors when using trino that can be easily reproduced using hive4 branch of this repo.

  • Official Hive 3.1.3 docker image does not start, showing database initializacions errors. Consecuently, I wrote my own Dockerfile for installing HMS 3.1.3.

Apache Iceberg REST catalog

[2023/09/03] Trino is compatible with several metastore catalogs apart from Hive Metastore like REST Catalog and JDBC Catalog. However trino does not have full suport for them, making imposible to crate views and materialized views with REST Catalog and JDBC Catalog. Consequently, modern-data-stack is still based on Hive MetaStore (HMS) until this limitations are overcame.

Additionally, since trino does not allow to redefine the S3 endpoint when using the REST catalog, trino will always try to connect to AWS S3 public cloud and not to local MinIO. Iceberg REST catalog is not an option currently for this PoC.

Spark works perfectly well with Iceberg REST catalog and the spark-defaults needed are is this file. However, the selected metastore is HMS mainly for compatibility issues (specially with trino)

Useful links

About

Data stack integrating several open-source data projects (trino, hive metastore, minIO, etc)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.4%
  • Shell 3.3%
  • Dockerfile 2.3%