Minimal example integrating docker images of the following Big Data open-source projects:
- trino: v425 - http://localhost:8080
- MinIO: v2023.08.23 - http://localhost:9000
- HMS (Hive MetaStore): v3.1.3
- Apache Spark: v3.4.1
- Apache Iceberg: v1.6
- Jupyter notebooks: v1.0.0 - http://localhost:8000/tree
There are also some technologies tested but finally discarted (have a look to the incompatibilities section):
- Iceberg REST catalog: v.16 - not compatible with trino when warehouse is NOT in AWS S3
Since the open-source big data ecosystem is vibrant, this modern-data-stack is always evolving. Currently, only the above projects are integrated but in a near future, other complementary and promising projects will be considered like:
- OpenMetadata (data catalog)
- Apache Ranger (data security)
Start all docker containers with:
docker-compose up -d
Since this repo is for teaching purposes, a .env file
is provided. The provided access keys are totally disposable and not used in any system. For an easy setup, it's recommended to keep that access keys unaltered to avoid changing the following configuration files:
To provision access keys and creating the bucket in the MinIO server, just type:
docker-compose exec minio bash /opt/bin/init_datalake.sh
Since all Spark source base runs on JVM platform, using spark-shell (instead of pyspark) is recommended for troubleshooting installation errors. This avoid confusing wrapped Python-style errors of JVM components.
To test the installation, just start a scala-based spark shell, just type:
docker-compose exec spark spark-shell
Just past this code in the shell in order to test the connection between Spark, Iceberg runtime and Hive metastore:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
spark.sql("CREATE DATABASE IF NOT EXISTS nyc300 LOCATION 's3a://warehouse/nyc300';")
val schema = StructType( Array(
StructField("vendor_id", LongType,true),
StructField("trip_id", LongType,true),
StructField("trip_distance", FloatType,true),
StructField("fare_amount", DoubleType,true),
StructField("store_and_fwd_flag", StringType,true)
))
val df = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],schema)
df.writeTo("hms.nyc300.test").create()
No errors should be raised. For debugging purposes, log files are always a good place to look at. Log files are stored in spark docker container and can checked like this:
docker-compose exec spark bash
ls -al /home/iceberg/spark-events/*
Once installation is properly set up, using jupyter notebooks is much more covenient than CLI tools. Since python kernel is distributed in the spark-iceberg docker image, all coding examples are developed in python and SQL. No scala is used to simplify reading the code.
Open the notebook called Testing Iceberg tables. This notebook is totally inspired by excelent Iceberg quick start guide.
The Spark hms catalog
is configured in this file and passed to the spark container as the spark-defaults.conf file. This file sets Iceberg as default table format for this catalog. It also sets HMS as catalog's metastore.
If everything is properly setup, a new namespace (a.k.a database) called nyc100
will be created in HMS executing the first notebook cell. All tables created in this notebook, using both python spark API and SQL, uses Iceberg table format underneath due to the spark-defaults defined here.
trino client is installed in the trino container. Just connect to it:
docker-compose exec trino trino
Using trino with the iceberg connector sets the default table format to Iceberg. Creating a trino catalog
called hms, using Iceberg connector and poiting to HMS, is done in this file. Iceberg tables created from Spark in HMS can be used seamlessly from trino and vice-versa.
CREATE SCHEMA IF NOT EXISTS hms.nyc200 WITH (location = 's3a://warehouse/nyc200');
CREATE TABLE IF NOT EXISTS hms.nyc200.sales_from_trino (
productcategoryname VARCHAR,
productsubcategoryname VARCHAR,
productname VARCHAR,
customerName VARCHAR,
salesTerritoryCountry VARCHAR,
salesOrderNumber VARCHAR,
orderQuantity INTEGER
);
select * from hms.nyc200.sales_from_trino;
[2023/08/26] Since Apache Hive team publishes in docker hub linux/amd64 images for both v3 and v4 (alpha and beta prereleases), I decided to test them:
-
Both alpha v4 and beta v4 prereleases work perfectly well but Trino v425 is only compatible with Hive Metastore Thrift API v3.1. In real life usage, this produces some incompatibility errors when using trino that can be easily reproduced using hive4 branch of this repo.
-
Official Hive 3.1.3 docker image does not start, showing database initializacions errors. Consecuently, I wrote my own Dockerfile for installing HMS 3.1.3.
[2023/09/03] Trino is compatible with several metastore catalogs apart from Hive Metastore like REST Catalog and JDBC Catalog. However trino does not have full suport for them, making imposible to crate views and materialized views with REST Catalog and JDBC Catalog. Consequently, modern-data-stack is still based on Hive MetaStore (HMS) until this limitations are overcame.
Additionally, since trino does not allow to redefine the S3 endpoint when using the REST catalog, trino will always try to connect to AWS S3 public cloud and not to local MinIO. Iceberg REST catalog is not an option currently for this PoC.
Spark works perfectly well with Iceberg REST catalog and the spark-defaults needed are is this file. However, the selected metastore is HMS mainly for compatibility issues (specially with trino)
- https://iceberg.apache.org/spark-quickstart/
- https://tabular.io/blog/docker-spark-and-iceberg/
- https://blog.min.io/manage-iceberg-tables-with-spark/
- https://blog.min.io/iceberg-acid-transactions/
- https://tabular.io/blog/rest-catalog-docker/
- https://blog.min.io/lakehouse-architecture-iceberg-minio/
- https://tabular.io/blog/iceberg-fileio/?ref=blog.min.io
- https://tabular.io/guides-and-papers/
- https://www.datamesh-architecture.com/tech-stacks/minio-trino
- https://dev.to/alexmercedcoder/configuring-apache-spark-for-apache-iceberg-2d41
- https://iceberg.apache.org/docs/latest/configuration/#catalog-properties