Add IcebergDocument as one type of the operator result storage #3147

bobbai00 · 2024-12-10T16:38:08Z

Implement Apache Iceberg for Result Storage

How to Enable Iceberg Result Storage

Update storage-config.yaml:
- Set result-storage-mode to iceberg.

Major Changes

Introduced IcebergDocument: A thread-safe VirtualDocument implementation for storing and reading results in Iceberg tables.
Introduced IcebergTableWriter: Append-only writer for Iceberg tables with configurable buffer size.
Catalog and Data storage for Iceberg: Uses a local file system (file:/) via HadoopCatalog and HadoopFileIO. This ensures Iceberg operates without relying on external storage services.
ProgressiveSinkOpExec with a new parameter workerId is added. Each writer of the result storage will take this workerId as one new parameter.

Dependencies

Added Apache Iceberg-related libraries.
Introduced Hadoop-related libraries to support Iceberg's HadoopCatalog and HadoopFileIO. These libraries are used for placeholder configuration but do not enforce runtime dependency on HDFS.

Overview of Iceberg Components

`IcebergDocument`

Manages reading and organizing data in Iceberg tables.
Supports iterator-based incremental reads with thread-safe operations for reading and clearing data.
Initializes or overrides the Iceberg table during construction.

`IcebergTableWriter`

Writes data as immutable Parquet files in an append-only manner.
Each writer uniquely prefixes its files to avoid conflicts (workerIndex_fileIndex format).
Not thread-safe—single-thread access is recommended.

Data Storage via Iceberg Tables

Write:
- Tables are created per storage key.
- Writers append Parquet files to the table, ensuring immutability.
Read:
- Readers use IcebergDocument.get to fetch data via an iterator.
- The iterator reads data incrementally while ensuring data order matches the commit sequence of the data files.

Data Reading Using File Metadata

Data files are read using getUsingFileSequenceOrder, which:
- Retrieves and sorts metadata files (FileScanTask) by sequence numbers.
- Reads records sequentially, skipping files or records as needed.
- Supports range-based reading (from, until) and incremental reads.
Sorting ensures data consistency and order preservation.

Hadoop Usage Without HDFS

The HadoopCatalog uses an empty Hadoop configuration, defaulting to the local file system (file:/).
This enables efficient management of Iceberg tables in local or network file systems without requiring HDFS infrastructure.

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala

This reverts commit a2e53b5.

...flow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/iceberg/IcebergDocument.scala

shengquan-ni

LGTM

… into jiadong-add-file-result-storage

This PR adds a storage layer implementation on the Python side of Texera's codebase, mirroring the implementation of our Java-based storage layer. ## Motivation - The primary motivation of having a storage layer in Python so that we can let Python UDF operators' ports write directly to result tables without needing to send the results back to Java. - In the future we will also use the Python storage layer for UDF logs and workflow runtime statistics. ## Storage APIs - There are 3 abstract classes in Java's storage implementation: - `ReadOnlyVirtualDocument` for read-only tables - `VirtualDocument` for tables supporting both read and write operations. - `BufferedItemWriter` as a writer class of `VirtualDocument` - We mirror the implementation in Python, but keep only the APIs relevant to table storage (e.g., APIs related to dataset storage are not kept in Python.) ## Iceberg Document Following #3147, we add a table-storage implementation based on Apache Iceberg (pyiceberg), including `IcebergDocument`, `IcebergTableWriter`, `IcebergCatalogInstance`, and related util functions and tests. ### Limitations of / TODOs for python implementation pyiceberg is less mature than its java-based counterpart. As a result there are a few functionalities not supported in our current Python storage implementation. #### Incremental Read Incremental Read is not supported by pyiceberg. It will be supported [in the future](apache/iceberg-python#533). Before then we will not include incremental read in our Python codebase (it is also not currently needed) #### Concurrent writers Iceberg uses optimistic concurrency control for concurrent writers. Java Iceberg natively supports retry with configurable retry parameters, using exponential backoff (without randomness). However pyiceberg does not currently support retry. We implemented an ad-hoc custom retry mechanism in `IcebergTableWriter`, using exponential random backoff based on the [tenacity](https://tenacity.readthedocs.io/en/latest/) library. It has a good speed (~0.6s for 10 concurrent writers writing 20K tuples) and is faster than Java’s iceberg-native retry (~6 seconds for the same test). We may need to re-evaluate this custom implementation if pyiceberg supports retry natively in the future. ## Iceberg Catalog pyiceberg only supports SQL catalog (postgreSQL to be specific) and REST catalog for production. We use postgresql based SQL catalog in this implementation for the following reasons: - It supports local storage. - We tested that it is works with both Java and Python iceberg storage. - It is easier to set up for developers (compared to REST services). ### PostgreSQL setup Python storage layer requires a running postgreSQL service in the environment, and an empty database for iceberg to work. - **A script to set up a new postgres database for Texera's iceberg storage has been added for CI tests.** - The database will be used by pyiceberg to manage the catalog. - The logic to setup the database is added in GitHub CI config. - Java side can continue using Hadoop-based catalog for now until we add storage on operator ports for both Java and Python. - As the Python storage is not currently used by Python workers, no action is required for developers for now. ### REST catalogs (feel free to skip this section) I also explored 3 major REST catalog implementations ([lakekeeper](https://lakekeeper.io), [polaris](https://polaris.apache.org), and [gravitino](https://gravitino.apache.org)) and here are some observations: - REST catalogs are the trend primarily because different query engines (Spark, Flink, Snowflake, etc.) relying on iceberg need a central place to keep and manage the catalogs. Under the hood they all still use some database as their storage layer. - Most of them support / recommend cloud storage only in production and do not support local storage. - They are incubating projects and lack documentation. For example I find it very hard to set up authentication (as pyiceberg requires authentication to work with REST catalogs) using gravitino, and using them will add a lot more burden to our developers. - I have successfully made polaris work with our implementation after setting up auth, but somehow it was very very slow. - As postgres catalog is working, we will explore more about REST catalog in the future if have migrated to cloud storage and have scalability issues. ## Storage configurations A static class `StorageConfigs` is added to manage storage-related configurations. We do NOT read the configs from files. Instead we will let Java pass the configs to Python worker, and the config will be filled when initializing the worker. The storage config is hardcoded in CI tests. ## Other items `VFSURIFactory` and `DocumentFactory` are added in Python storage layer mirroring the Java implementations. ## TODO for Java Storage - Add SQL catalog as another type of iceberg catalog --------- Co-authored-by: Jiadong Bai <[email protected]>

bobbai00 requested a review from shengquan-ni December 10, 2024 16:38

bobbai00 self-assigned this Dec 10, 2024

bobbai00 requested a review from Yicong-Huang December 10, 2024 16:39

shengquan-ni reviewed Dec 10, 2024

View reviewed changes

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala Outdated Show resolved Hide resolved

shengquan-ni reviewed Dec 10, 2024

View reviewed changes

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala Outdated Show resolved Hide resolved

bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 6522779 to a83d779 Compare December 14, 2024 00:14

bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 1edb551 to cef347b Compare December 21, 2024 02:56

bobbai00 changed the title ~~Add PartitionDocument and ItemizedFileDocument~~ Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results Dec 22, 2024

bobbai00 changed the title ~~Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results~~ Add IcebergDocument as one implementation of VirtualDocument Dec 22, 2024

bobbai00 added 19 commits December 22, 2024 16:10

add itemized file document and partition document

1fe9f17

add unit test for PartitionDocument

219b82d

add more to unit tests

e446e9c

make PartitionDocument return T

9627b25

fix partition document test

b85fd45

refining the documents

8e6fec3

add type R to PartitionedItemizedFileDocument

288aea4

do a rename

c3a1d00

adding the arrow file document, TODO: fix the test

97c601e

pass the compilation

e2c5515

finish arrow document

c17a54e

start to add some iceberg related

bc38cc4

finish initial iceberg writer

51dd7cf

finish initial version of iceberg

481c437

refactor test parts

0274f66

finish 1st viable version

4663fef

fix the append read

9607f98

finish append read

d2d0ed7

finish concurrent write test

f4ea0e3

bobbai00 added 9 commits January 1, 2025 15:11

Revert "add format version and row lineage to the iceberg table"

f54e38c

This reverts commit a2e53b5.

fix iceberg util spec

7176864

try to add the record id

76dd31c

try debugging the test

31070be

half way to have a consistent order

1156db4

fix the get range

c712c1d

fix the get's refresh

a16ff80

add getAfter test

d2e710f

remove redundant dependency

a8bb3db

shengquan-ni reviewed Jan 3, 2025

View reviewed changes

...flow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/iceberg/IcebergDocument.scala Outdated Show resolved Hide resolved

bobbai00 added 8 commits January 5, 2025 15:06

remove redundant file document

6a69575

add hadoop catalog

0e28267

remove local file io by using hadoop file io

0506e4c

clear up the catalog, just keep hadoop catalog

bfe0696

add more comments

9861246

do more cleanup

fe543f7

clean up the dependencies

afd5d78

clean up the dependencies

3282aa3

shengquan-ni approved these changes Jan 6, 2025

View reviewed changes

shengquan-ni and others added 3 commits January 6, 2025 13:41

Merge branch 'master' into jiadong-add-file-result-storage

9238d27

fix the count

cfcd7f9

Merge remote-tracking branch 'origin/jiadong-add-file-result-storage'…

bddf022

… into jiadong-add-file-result-storage

bobbai00 changed the title ~~Add IcebergDocument as one implementation of VirtualDocument~~ Add IcebergDocument as one type of the operator result storage Jan 6, 2025

bobbai00 and others added 3 commits January 6, 2025 15:55

fix the dependency

4b56645

trigger the CI

75e554d

Merge branch 'master' into jiadong-add-file-result-storage

ecaeca5

bobbai00 merged commit 7debf45 into master Jan 8, 2025
8 checks passed

bobbai00 deleted the jiadong-add-file-result-storage branch January 8, 2025 23:46

Xiao-zhen-Liu mentioned this pull request Jan 30, 2025

Add Storage Layer In Python #3224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IcebergDocument as one type of the operator result storage #3147

Add IcebergDocument as one type of the operator result storage #3147

bobbai00 commented Dec 10, 2024 •

edited

Loading

shengquan-ni left a comment

Add IcebergDocument as one type of the operator result storage #3147

Add IcebergDocument as one type of the operator result storage #3147

Conversation

bobbai00 commented Dec 10, 2024 • edited Loading

Implement Apache Iceberg for Result Storage

How to Enable Iceberg Result Storage

Major Changes

Dependencies

Overview of Iceberg Components

IcebergDocument

IcebergTableWriter

Data Storage via Iceberg Tables

Data Reading Using File Metadata

Hadoop Usage Without HDFS

shengquan-ni left a comment

Choose a reason for hiding this comment

bobbai00 commented Dec 10, 2024 •

edited

Loading

`IcebergDocument`

`IcebergTableWriter`