Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance Improvement] Support for AQE mode for delayed query pushdown for optimum runtime & improved debugging #535

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jalpan-randeri
Copy link

Under Adaptive query execution mode, Spark overlaps planning and execution phase, This results in spark running planning multiple time. The current implementation eagerly pushdown query in planning stage, this result into redundant query pushdown to snowflake and it ignores the runtime discovered filters.

This commit handles this scenario and delayed pushdown, This gives Spark AQE a chance to generate the most optimum plan and eliminating the pushdown of redundant queries. This results in improved performance as new filters identified at runtime by AQE are pushdown.

Furthermore, it logs the pushdown query into spark plan. This allow easy debugging from Spark History Server and UIs and from logs.

This PR adds new unit test suit for it.

This commit handles the scenario where Apache Spark is running under AQE
mode, snowflake connector to delayed pushdown. This allows spark to generate
more optimium query plan. This results in improved performance.

Furthermore, it logs the pushdown query into spark plan. This allow
easy debugging from Spark History Server and UIs.
@urosstan-db
Copy link

@jalpan-randeri Do we plan to merge this, this can fix following issue also
#567

@urosstan-db
Copy link

@sfc-gh-bli Do we plan to merge this PR?

@jalpan-randeri
Copy link
Author

Yes, i plan to merge this. However I am waiting for review of this PR. Can you review it?

@transient implicit private var data: Future[PushDownResult] = _
@transient implicit private val service: ExecutorService = Executors.newCachedThreadPool()

override protected def doPrepare(): Unit = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is benefit of building RDD in doPrepare instead of doExecute?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doPrepare method allows spark planner in performing initial metadata collection work in async fashion. While do execute is always blocking call. Thus leveraging doPrepare we can move some work such as building sql and creating connection with snowflake in background thread while main planner operates on other nodes in the plan giving some perf gains

@urosstan-db
Copy link

Yes, i plan to merge this. However I am waiting for review of this PR. Can you review it?

Overall, it looks good, but I am not commiter, so you need approval from someone from snow

@jalpan-randeri
Copy link
Author

@sfc-gh-bli please review and share your thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants