[FEATURE] Performance - LIMIT clause does been optimized in execution stage #754

penghuo · 2024-10-08T21:22:52Z

Is your feature request related to a problem?

Problem statements

A query with a LIMIT clause, such as SELECT * FROM cloudtrail LIMIT 10, takes 1.5 minutes to execute when querying a CloudTrail dataset containing over 10,000 files.
Key issues observed:

Excessive Resource Consumption: The query plans more than 10K tasks, With the Dynamic Resource Allocation (DRA) feature, EMR-S Spark spins up additional nodes, which adds a delay of 30+ seconds.
Inefficient Execution: Despite Spark's limit optimization, which is designed to minimize unnecessary file scans by incrementally reading splits, the query is not efficiently skipping unneeded files.

Root cause analysis

We use the following example to explain the problem. The dataset consists of 225 files, and Spark splits and groups them into 12 input splits.
When the user submits the query SELECT * FROM alb_logs LIMIT 10, the expected behavior is that Spark should scan only one split (i.e., one file). If the query successfully retrieves the 10 rows, it returns the result without scanning additional files. Otherwise, Spark will scan more files, controlled by the spark.sql.limit.scaleUpFactor. For instance, the query execution plan for this query is shown in the figure below. The Spark job only contains one task, which fetches 10 rows from a single file without requiring a shuffle stage. The entire job took 24 milliseconds to complete.

However, when the query is submitted through FlintREPL, it interacts with Spark using the following code: spark.sql("SELECT * FROM alb_logs LIMIT 10").toJSON.collect()
In this case, the execution plan differs. The introduction of the toJSON operator causes Spark to split the execution into two stages, with a shuffle stage in between. which leads to unnecessary overhead.

What solution would you like?
I would like a solution that can progressively plan the InputPartition and collect only the necessary dataset.

What alternatives have you considered?
n/a

Do you have any additional context?
attached

LantaoJin · 2024-10-10T07:16:10Z

@penghuo I think this is a bug in Spark. I filed a PR apache/spark#48407 to fix it. Before being merged, if this is a very common usage in our Flint, maybe we should remove the toJSON and translate dataset to JSON format inside Flint.

penghuo added enhancement New feature or request untriaged labels Oct 8, 2024

penghuo changed the title ~~[FEATURE] Performance - LIMIT clause does not push down~~ [FEATURE] Performance - LIMIT clause does been optimized in execution stage Oct 8, 2024

LantaoJin removed the untriaged label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Performance - LIMIT clause does been optimized in execution stage #754

[FEATURE] Performance - LIMIT clause does been optimized in execution stage #754

penghuo commented Oct 8, 2024

LantaoJin commented Oct 10, 2024

[FEATURE] Performance - LIMIT clause does been optimized in execution stage #754

[FEATURE] Performance - LIMIT clause does been optimized in execution stage #754

Comments

penghuo commented Oct 8, 2024

Problem statements

Root cause analysis

LantaoJin commented Oct 10, 2024