A curated list of awesome open-source Online Analytical Processing databases, frameworks, ressources, tools and other awesomeness, for data engineers.
- OLAP Databases
- Data lake
- Brokers and distributed messaging
- Ingestion and querying
- Scheduler
- ETL, ELT and reverse ETL
- Datasets
- Benchmark
- Readings
- People to follow
- Events
- Communities
The following columnar databases use a shared-nothing architecture and provide a sub-second response time. DDL, DML and DCL are operated via SQL. These databases also support tiering for long-term cold storage.
- Elasticsearch - Search and analytics engine based on Apache Lucene.
- Meilisearch - Open source search engine, that aims to be a ready-to-go solution.
- OpenSearch - Apache 2.0 fork of Elasticsearch.
- Quickwit - Search engine on top of object storage, using shared-everything architecture.
- Typesense - Πpen-source, typo-tolerant search engine optimized for instant search-as-you-type experiences and developer productivity.
- Citus - PostgreSQL compatible distributed table.
- TiDB - MySQL compatible SQL database that supports hybrid transactional and analytical processing workloads.
- Grafana Mimir - Prometheus compatible TSDB on top of object storage.
- TimeScaleDB - PostgreSQL compatible TSDB.
The data lake approach (or "lakehouse") is a semi-structured schema that sits on top of object storage in the cloud.
It is composed of a few layers (from lower to higher level): codec, file format, table format + metastore, and the ingestion/query layer.
These formats are popular for shared-everything databases, using object storage as a persistence layer. The data is organized in row or column, with strict schema definition. These files are immutable and offer partial reads (only headers, metadata, data page, etc). Mutation requires a new upload. Most formats support nested schema, codecs, compression, and data encryption. Index can be added to file metadata for faster processing.
A single file can weight between tens of MB to a few GB. Lots of small files require more merge operation. Larger files can be costly to update.
- Apache Arrow Columnar Format - Columnar format for in-memory Apache Arrow processing.
- Apache Avro - Row-oriented serialization for data streaming purpose.
- Apache ORC - Column-oriented serialization for data storage purpose. Part of Hadoop platform.
- Apache Parquet - Column-oriented serialization for data storage purpose.
- Apache Thrift - Row-oriented serialization for RPC purpose.
- Google Protobuf - Row-oriented serialization for RPC purpose.
- Capβn Proto - Row-oriented serialization with zero-copy access, as fast as mmap.
- Flatbuffer - Row-oriented serialization with zero-copy access, as fast as mmap.
- Schema Registry - Centralized repository for validating row-oriented events. Part of Kafka and Confluent platform.
Open table formats are abstraction layer on top of Avro/Parquet files, with support for ACID transaction, CDC, partitioning, mixed streaming/batching processing, schema evolution and mutation. Schema and statistics are stored in a metastore, data is persisted locally or in a remote/cloud object storage.
Open tables are a cost-effective datawarehouse for petabyte scale.
Comparison:
- (2022) https://medium.com/geekculture/open-table-formats-delta-iceberg-hudi-732f682ec0bb
- (2023) https://aws.amazon.com/blogs/big-data/choosing-an-open-table-format-for-your-transactional-data-lake-on-aws/
π Warning: pre-2022 articles should be considered as out-of-date, as open table formats are evolving quickly.
- AWS Glue
- Databricks unity catalog
- Hive Metastore - Component of Hadoop HiveServer2, that can be used standalone.
- Nessie
- Apache HDFS - Hadoop distributed file system.
- AWS S3
- Azure Blob Storage
- GCP Cloud Storage
- Minio - S3 compatible and self-hosted object storage.
Process a set of data in real-time (or near-real-time), as it is being generated.
- Apache Beam - Unified SDK for cross language stream processing. Available in Go, Python, Java, Scala and Typescript.
- Apache Flink - Stateful stream processing.
- Apache Kafka stream - Stream processing.
- Apache Spark streaming - Stream processing on top of Spark.
- Akka stream - Stream processing.
- Benthos - Go stream processing.
Process periodically a large amount of data in a single batch.
Non real-time SQL queries executed against a large database can be processed locally. This method might not fit into memory or lead to very long job duration.
- Apache Arrow - Low-level in-memory data processing. Zero-copy data manipulation for any language, via gRPC/IPC interfaces.
- Apache Arrow Datafusion - High level SQL interface for Apache Arrow.
- delta-rs - Standalone DeltaLake driver for Python and Rust. Do not depend on Spark.
- Delta Standalone - Standalone DeltaLake driver for Java and Scala. Do not depend on Spark.
- DuckDB - In-process SQL query engine for processing Parquet files. Built on top of Apache Arrow.
- Pandas - Python data analysis and manipulation tool.
- clickhouse-local - Lightweight CLI version of Clickhouse for running SQL queries against CSV, JSON, Parquet, etc files.
These SQL engines distribute SQL queries processing of very large database on a cluster. Support of ANSI SQL.
- Apache Spark SQL - Distributed SQL query engine that sit on top of Spark.
- ksql - SQL interface for Kafka.
- PrestoDB - Distributed SQL query engine.
- Trino - Distributed SQL query engine. Fork of PrestoDB.
These tools allow to orchestrate, schedule and monitor repetitive data transformations, in a workflow manner.
The popular acronym for Extracting, Transforming and Loading data. ELT performs data transformations directly within the data warehouse. Reverse ETL is the process of copying data from your datawarehouse to external tools or SaaS.
- Airbyte - ELT.
- Census - Reverse ETL.
- RudderStack - Customer Data Platform. Pipeline between a tracking plan, event transformation, and destination tools (datawarehouse or SaaS).
- awesome-public-datasets
- CommonCrawl
- Criteo
- Entso-e
- GitHub Archives
- Kaggle - Community sourced dataset.
- NYCTaxy
- Jepsen - Distributed databases, queues and consensus protocols testing.
- TPC family benchmarks - For big data based database.
- Partitioning
- Data skipping
- Statistics
- High cardinality
- HyperLogLog
- Bloom filters
- Minmax
- Z-ordering
- Bitmap index
- Dense index
- Sparse index
- Reverse index
- N-gram
- TF-IDF
- LSM Tree
- Apache Arrow vectorized execution
- Apache Arrow SIMD parallel processing
- Cockroach vectorized JOIN
- Latency comparison numbers
- Engineering at Meta
- Engineering at Criteo
- Engineering at Uber
- Engineering at Airbnb
- Databricks
- Towards Data Science
- https://www.moderndatastack.xyz/
- https://books.japila.pl/
- https://jepsen.io/analyses
- https://github.com/aphyr/distsys-class
// TODO
// TODO
// TODO
Contributions of any kind welcome, just follow the guidelines!
Give a βοΈ if this project helped you!
Copyright Β© 2023 Samuel Berthe.
This project is MIT licensed.