Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog #23909

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shohamyamin
Copy link

@shohamyamin shohamyamin commented Oct 25, 2024

Improve Performance for Retrieving Tables Metadata from Iceberg Catalog

Overview

This PR addresses a performance issue encountered when querying column metadata from the Iceberg catalog in Trino (Issue #23468). The primary concern is that requests to retrieve table metadata are executed sequentially for each table, which significantly impacts query performance when dealing with a large number of tables.

Description

The issue manifests when executing the following query on the Iceberg catalog:

SELECT * FROM iceberg.information_schema.columns;

In environments with a large number of tables, the query’s response time increases considerably due to the sequential execution of catalog requests.

Changes Made

To improve the performance, this PR introduces parallelism in the metadata retrieval process. The old code executed catalog requests sequentially for each table, which resulted in longer execution times. The updated code utilizes CompletableFuture to handle requests asynchronously, thereby reducing the overall execution time.

The updated implementation leverages CompletableFuture to process tables concurrently, significantly reducing the latency in retrieving column metadata.

Additional Context and Related Issues

Issue #23468

Test Setup & Observations:

I checked the code changes and get faster query in over x4 faster then without the proposed change

  • Trino Version: 463 and trino with the code in master branch with the proposed change lets called it trino 464-SNAPSHOT
  • Nessie Version: 0.94.4 (In memory)
  • REST Catalog based on JDBC (e.g., Tabular REST Catalog)
  • Testing environment: Docker Compose with MinIO for Iceberg storage and Jaeger for tracing.
  • Hardware: Intel Core i9 processor, 32GB RAM

Performance Comparison

As you can see in this table we get around 70% reduce in time for retrieving the columns

Number of Tables Trino Version Catalog Setup Query Execution Time (s) Performance Increase (%)
10 Trino 463 REST 0.691
Nessie 1.07
Trino 464-SNAPSHOT REST 0.677 2.02%
Nessie 0.982 8.23%
----------------------- ------------------- ------------------- ------------------------------- ------------------------------
100 Trino 463 REST 1.24
Nessie 1.89
Trino 464-SNAPSHOT REST 0.772 37.90%
Nessie 1.29 31.59%
----------------------- ------------------- ------------------- ------------------------------- ------------------------------
1,000 Trino 463 REST 3.06
Nessie 6.55
Trino 464-SNAPSHOT REST 1.17 61.59%
Nessie 2.42 63.01%
----------------------- ------------------- ------------------- ------------------------------- ------------------------------
10,000 Trino 463 REST 11.38
Nessie 29.59
Trino 464-SNAPSHOT REST 3.41 70.04%
Nessie 7.70 73.00%
----------------------- ------------------- ------------------- ------------------------------- ------------------------------
100,000 Trino 463 REST 105
Nessie 255
Trino 464-SNAPSHOT REST 16.96 83.86%
Nessie 34.72 86.38%
  • Each table in this comparison has 10 columns
  • I ran each test several times, and the results were largely consistent.

Release Notes

(X) Release notes are required, with the following suggested text:

  • Performance improvement for retrieving tables metadata from Iceberg catalog. (Issue #23468)

@cla-bot cla-bot bot added the cla-signed label Oct 25, 2024
@github-actions github-actions bot added the iceberg Iceberg connector label Oct 25, 2024
@shohamyamin shohamyamin changed the title Improve Performance for Retrieving Column Metadata from Iceberg Catalog Improve Performance for Retrieving Table Metadata from Iceberg Catalog Oct 25, 2024
@shohamyamin shohamyamin changed the title Improve Performance for Retrieving Table Metadata from Iceberg Catalog Improve Performance for Retrieving Tables Metadata from Iceberg Catalog Oct 29, 2024
@alonahmias
Copy link

Thank you for this PR! This solution makes a huge difference in reducing the time needed to retrieve all available columns in Iceberg. The performance improvement is very noticeable, especially with larger datasets

}
}
List<CompletableFuture<Void>> futures = remainingTables.stream()
.map(tableName -> CompletableFuture.runAsync(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't use the common fork join pool in Trino. We should inject a dedicated executor which is bounded at some number and use that.
GET_METADATA_BATCH_SIZE is 1000, I don't think we want that many parallel fetches, something like 8 parallel fetches is probably enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raunaqmorarka can you help me do that(or even point me to similar code) ? I am not sure how and where to create that dedicated executor and how to inject it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants