Question around blog - Serverless asynchronous Iceberg data ingestion #15

venkyvb · 2022-10-20T16:14:28Z

venkyvb
Oct 20, 2022

In this blog it is mentioned that Iceberg There's several considerations here. First, we need to ensure we are not making too many concurrent commits to an Iceberg table. In short, each time we append files to an Iceberg table, that creates a commit, and you cannot have a high number of concurrent commits to a single table because Iceberg needs to maintain atomicity by locking the table. . AFAIK Iceberg doesnt lock the table, it uses MVCC approach. If I understand, the issue seems to be about creating multiple snapshots resulting in metadata files and small data files. Or does this refer to specific metastore implementations?

shaeqahmed · 2022-10-20T20:27:37Z

shaeqahmed
Oct 20, 2022
Maintainer

Iceberg relies on snapshot isolation and optimistic locking at the table level to handle concurrent writes. This does not scale beyond just a few concurrenct writers, so in a streaming/realtime use case it is important to buffer data and commit every x seconds. This is what the article explains in terms of the asynchronous ingestion pattern, where files are to written s3 in parallel and later committed to iceberg in realtime batches to avoid concurrent writes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question around blog - Serverless asynchronous Iceberg data ingestion #15

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question around blog - Serverless asynchronous Iceberg data ingestion #15

venkyvb Oct 20, 2022

Replies: 1 comment

shaeqahmed Oct 20, 2022 Maintainer

venkyvb
Oct 20, 2022

shaeqahmed
Oct 20, 2022
Maintainer