-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance Improvement] Support for AQE mode for delayed query pushdown for optimum runtime & improved debugging #535
base: master
Are you sure you want to change the base?
Conversation
This commit handles the scenario where Apache Spark is running under AQE mode, snowflake connector to delayed pushdown. This allows spark to generate more optimium query plan. This results in improved performance. Furthermore, it logs the pushdown query into spark plan. This allow easy debugging from Spark History Server and UIs.
@jalpan-randeri Do we plan to merge this, this can fix following issue also |
@sfc-gh-bli Do we plan to merge this PR? |
Yes, i plan to merge this. However I am waiting for review of this PR. Can you review it? |
@transient implicit private var data: Future[PushDownResult] = _ | ||
@transient implicit private val service: ExecutorService = Executors.newCachedThreadPool() | ||
|
||
override protected def doPrepare(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is benefit of building RDD in doPrepare instead of doExecute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doPrepare method allows spark planner in performing initial metadata collection work in async fashion. While do execute is always blocking call. Thus leveraging doPrepare we can move some work such as building sql and creating connection with snowflake in background thread while main planner operates on other nodes in the plan giving some perf gains
Overall, it looks good, but I am not commiter, so you need approval from someone from snow |
@sfc-gh-bli please review and share your thoughts |
Under Adaptive query execution mode, Spark overlaps planning and execution phase, This results in spark running planning multiple time. The current implementation eagerly pushdown query in planning stage, this result into redundant query pushdown to snowflake and it ignores the runtime discovered filters.
This commit handles this scenario and delayed pushdown, This gives Spark AQE a chance to generate the most optimum plan and eliminating the pushdown of redundant queries. This results in improved performance as new filters identified at runtime by AQE are pushdown.
Furthermore, it logs the pushdown query into spark plan. This allow easy debugging from Spark History Server and UIs and from logs.
This PR adds new unit test suit for it.