Improve Performance for Retrieving Tables Metadata from Iceberg Catalog #23909

shohamyamin · 2024-10-25T01:42:31Z

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog

Overview

This PR addresses a performance issue encountered when querying column metadata from the Iceberg catalog in Trino (Issue #23468). The primary concern is that requests to retrieve table metadata are executed sequentially for each table, which significantly impacts query performance when dealing with a large number of tables.

Description

The issue manifests when executing the following query on the Iceberg catalog:

SELECT * FROM iceberg.information_schema.columns;

In environments with a large number of tables, the query’s response time increases considerably due to the sequential execution of catalog requests.

Changes Made

To improve the performance, this PR introduces parallelism in the metadata retrieval process. The old code executed catalog requests sequentially for each table, which resulted in longer execution times. The updated code utilizes CompletableFuture to handle requests asynchronously, thereby reducing the overall execution time.

The updated implementation leverages CompletableFuture to process tables concurrently, significantly reducing the latency in retrieving column metadata.

Additional Context and Related Issues

Issue #23468

Test Setup & Observations:

I checked the code changes and get faster query in over x4 faster then without the proposed change

Trino Version: 463 and trino with the code in master branch with the proposed change lets called it trino 464-SNAPSHOT
Nessie Version: 0.94.4 (In memory)
REST Catalog based on JDBC (e.g., Tabular REST Catalog)
Testing environment: Docker Compose with MinIO for Iceberg storage and Jaeger for tracing.
Hardware: Intel Core i9 processor, 32GB RAM

Performance Comparison

As you can see in this table we get around 70% reduce in time for retrieving the columns

Number of Tables	Trino Version	Catalog Setup	Query Execution Time (s)	Performance Increase (%)
10	Trino 463	REST	0.691
		Nessie	1.07
	Trino 464-SNAPSHOT	REST	0.677	2.02%
		Nessie	0.982	8.23%
-----------------------	-------------------	-------------------	-------------------------------	------------------------------
100	Trino 463	REST	1.24
		Nessie	1.89
	Trino 464-SNAPSHOT	REST	0.772	37.90%
		Nessie	1.29	31.59%
-----------------------	-------------------	-------------------	-------------------------------	------------------------------
1,000	Trino 463	REST	3.06
		Nessie	6.55
	Trino 464-SNAPSHOT	REST	1.17	61.59%
		Nessie	2.42	63.01%
-----------------------	-------------------	-------------------	-------------------------------	------------------------------
10,000	Trino 463	REST	11.38
		Nessie	29.59
	Trino 464-SNAPSHOT	REST	3.41	70.04%
		Nessie	7.70	73.00%
-----------------------	-------------------	-------------------	-------------------------------	------------------------------
100,000	Trino 463	REST	105
		Nessie	255
	Trino 464-SNAPSHOT	REST	16.96	83.86%
		Nessie	34.72	86.38%

Each table in this comparison has 10 columns
I ran each test several times, and the results were largely consistent.

Release Notes

(X) Release notes are required, with the following suggested text:

Performance improvement for retrieving tables metadata from Iceberg catalog. (Issue #23468)

alonahmias · 2024-11-03T14:21:22Z

Thank you for this PR! This solution makes a huge difference in reducing the time needed to retrieve all available columns in Iceberg. The performance improvement is very noticeable, especially with larger datasets

raunaqmorarka · 2024-11-05T07:53:31Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

-                        }
-                    }
+                    List<CompletableFuture<Void>> futures = remainingTables.stream()
+                            .map(tableName -> CompletableFuture.runAsync(() -> {


We generally don't use the common fork join pool in Trino. We should inject a dedicated executor which is bounded at some number and use that.
GET_METADATA_BATCH_SIZE is 1000, I don't think we want that many parallel fetches, something like 8 parallel fetches is probably enough.

@raunaqmorarka can you help me do that(or even point me to similar code) ? I am not sure how and where to create that dedicated executor and how to inject it

Parallelized iceberg table metadata loading using CompletableFuture

3c4b627

cla-bot bot added the cla-signed label Oct 25, 2024

github-actions bot added the iceberg Iceberg connector label Oct 25, 2024

shohamyamin added the performance label Oct 25, 2024

shohamyamin changed the title ~~Improve Performance for Retrieving Column Metadata from Iceberg Catalog~~ Improve Performance for Retrieving Table Metadata from Iceberg Catalog Oct 25, 2024

shohamyamin requested review from raunaqmorarka and findinpath October 25, 2024 02:38

shohamyamin changed the title ~~Improve Performance for Retrieving Table Metadata from Iceberg Catalog~~ Improve Performance for Retrieving Tables Metadata from Iceberg Catalog Oct 29, 2024

piotrrzysko requested review from findepi, piotrrzysko, lukasz-stec and hashhar November 4, 2024 16:25

raunaqmorarka reviewed Nov 5, 2024

View reviewed changes

shohamyamin mentioned this pull request Nov 12, 2024

Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta #24110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog #23909

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog #23909

shohamyamin commented Oct 25, 2024 •

edited

Loading

alonahmias commented Nov 3, 2024

raunaqmorarka Nov 5, 2024

shohamyamin Nov 8, 2024

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog #23909

Are you sure you want to change the base?

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog #23909

Conversation

shohamyamin commented Oct 25, 2024 • edited Loading

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog

Overview

Description

Changes Made

Additional Context and Related Issues

Test Setup & Observations:

Performance Comparison

Release Notes

alonahmias commented Nov 3, 2024

raunaqmorarka Nov 5, 2024

Choose a reason for hiding this comment

shohamyamin Nov 8, 2024

Choose a reason for hiding this comment

shohamyamin commented Oct 25, 2024 •

edited

Loading