-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] Implement stage-level resourceProfile auto-adjust framework to avoid oom #8018
Comments
@zjuwangg Thank you for your investigation! It's really something we'd like to do.
@PHILO-HE has done some investigation some time ago and noted some code changes in Vanilla Spark is necessary, did you noted it? if so we may hack the code in Gluten firstly then submit PR to upstream Spark. |
@zhli1142015 @Yohahaha @ulysses-you @jackylee-ch @kecookier @zhztheplayer A big feature! |
thank you for proposing this great idea and glad to see the POC has gain benefits in your prod env!
for me, the most interesting things is the DRA(dynamic resource allocation) must be enabled, I guess the reason is to change executor's memory settings after we found OOM occurs, otherwise, new executor/pod will still OOM then dead, lead to spark job failed finally. I found Uber has proposed a idea to solve pure on-heap OOM, it may helps understand more context about the reason for above requirement of DRA. |
Thank you for sharing this work. Look forward for an initial patch to try with. |
Thanks for the interesting PR! Curious about how on-heap and off-heap sizes are determined in current production environments, I'm looking forward to seeing it. |
@FelixYBW Thanks for detailed review!
We also have considered the situation when task is retied, but current spark seems no way to change the task's retry resource simpily.
If one stage is totally fallback, there will be no interface to change this stage resource profile! That's the design FAQ 2 discussed with. Maybe there are more changes is needed in Vanilla Spark, nice to here more advices. |
@FelixYBW, what we previously considered is how to make resource profile applied for Spark in whole stage fallback situation. We thought it may need some code to hack into Spark code to achieve that. I note this design tried to use a wrapper or a separate configuration. Not sure the feasibility. @zjuwangg, in your design, you mentioned that As @Yohahaha mentioned, Spark community has a SPIP for DynamicExecutorCoreResizing which is another attempt to manage to avoid the OOM failure. |
When Vanilla Spark jobs migrate to spark native jobs, we will recompute and set new memory config. If the total original memory is M (executor.memory + executor.offheap), we will set 0.7M as off-heap and 0.3M as on-heap in native memory. |
There should be no problem, current dynamic resource profile will search underlying rdd's resource profile, once we set the resource profile for WholeStageTransformer, the whole stage will all get affected(whether native node or original spark node.) |
@zjuwangg I think this part is comparatively tricker than other. Do you think you can start from an individual PR which adds an utility / API to do resource estimation on query plans (if the idea aligns with your approach, I am not sure)? Then in subsequent PRs we can adopt this API in Gluten and whole-stage transformer for remaining work. Moreover, does the feature target more for batch query scenarios (ETL, nightly, etc.)? Since I remember changing a Spark resource profile usually causes rebooting of executors, which will cause larger latency on ad-hoc queries? cc @PHILO-HE |
Uber is testing Gluten. Let me ping to see if they have interest. |
Should we hack the spark task scheduler to schedule the task with large onheap/offheap memory to the right executors? |
There are some talk previously with Pinterest team. We have two ways to do this:
To creat POC, we may start from 2 (with 1/2 executor.instances for offheap and 1/2 for onheap) or 3. You thought? To estimate the offheap/onheap ratio, we can start with a configurable value like 8:2 for offheap vs: 0:10 for onheap. Another thing to considerate is that Vanilla spark also support offheap memory, but it still needs large onheap memory, I didn't find any guideline how to config this. Either not sure if Spark community still are working to move all large memory allocation from onheap to offheap. If one day all large memory allocation in vanilla spark can be allocated in offheap, we don't have such issue, but the new issue how Gluten and spark share offheap memory which isn't fully solved today. |
@zhztheplayer Got your meaning, I will try in this way.
Yes, changing spark resource profile will require resource manager to allocate more executors, which may cause larger latency. In our production scenario, we more focus on improving ETL stability(avoid failure) |
On spark side, this feature need https://issues.apache.org/jira/browse/SPARK-50421 get fixed. |
You may pick it into Gluten temporarily. |
This has been fixed in Spark 3.5.4 and up-coming 4.0.0 |
I just opened the first PR #8195 which add set/get ResourceProfile interface in the GlutenPlan. Please help review it when your guys have time. |
@zjuwangg, I note you have a Spark PR merged to 3.5.4 & 4.0.0 to make a custom resource profile work. So this proposed Gluten feature will only support 3.5.4 and later version? |
In fact, if we just increase stage heap memory and don't adjust offheap memory, this feature can also work on Spark 3.1 + version. |
Implement stage-level resourceProfile auto-adjust framework to avoid oom
Backgroud
In our production environment, we suffer a lot from gluten jobs throwing heap OOM exception occasionally.
We hava digged into these problem, and there are major two kinds problem causing our jobs throwing oom:
heavy
means the upstream exchenage contains a huge M * N shuffle status(M means the shuffle mapper num and N means the reducer num), when this stage begins to do shuffle read, the executor side must keep the wholemapStatuses
of the upstream shuffle status, when M * N is large, it's very likely causing heap OOM exception.The root cause is for now in a same spark application, all stages share same task heap/offheap memory config, and when different stage requires different offheap/heap fraction, the problem appears. Since #4392 has proposed a potential solution to solve this type of problem, we did some verification based on this idea.
Design
WholeStageTransformer
andColumnarShuffleExchangeExec
Since all underlying native computation gets triggered from WholeStageTransformer or from ColumnarShuffle, we can add
in WholeStageTransformer, and when doCxecuteColumnar get Called and before rdd returned, set the resourceProfile for rdd.
GlutenDynamicAdjustStageRP
inHeuristicApplier
when aqe is enabled, we can check all operator in this stage and collect all child queryStage if exist belong to this stage.
After we have collected all plan nodes belong to this stage, we can know whether there exists fallback or not, also we can calculate the shuffle status complexity to roughly estimate mapStatus memory occupation. The rule works in follwing steps
We have completed a poc of this design and really sovled these two types oom problem, and we are refactoring code and plan to contribute to community.
Requirements
spark.dynamicAllocation.enabled
must be truePotential Other Benifit
FAQ
Multiple resource profile can be merged through spark's mechnism.
Potential solution: a) Wrap the whole fallbacked plan with a WrapperNode with interface and abillity to set ResourceProfile; b) Set default resource profile suitable for whole-stage-fallback stage and no need to set plan for this stage.
We‘d love to here more thoughts and receive more comments about this idea!
The text was updated successfully, but these errors were encountered: