-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advise does not give a stack_info when OOM is thrown #2203
Comments
/assign @fridex |
@Gkrumbach07 is this something still happening? |
I havent seen it in awhile, but I have been avoiding large runs. Is related to #1525 and #2204 If the issue above is resolved then this issue can be resolved. |
I was thinking about a solution for this. We could add inter-process communication that would guarantee that @goern Do you know people to reach out to support this feature? |
dont we see 137 errors on OOM'd pods? like
so we can check for this in the argo workflow? |
The problem here are users that will not get any recommendations if OOM is done. What we could do - if there is a risk of OOM, we could stop the resolution process and show users results that we have computed so far. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
/sig stack-guidance |
Given the memory optimizer implementation and the default values used in adviser deployments as a limit above which to run the optimizer (5.75GB < 6GB currently allocated), this issue should normally be solved, unless a "burst" in memory usage happens in between batches of resolver iterations, with a batch size defined by the Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted? cc @goern @harshad16 |
Following a discussion with @harshad16, a way to improve the limit would be to
In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly |
might be related : thoth-station/metrics-exporter#725
I'm pretty sure this is impossible. The oom (either from cgroup or the kernel) send a SIGKILL, which cannot be handled in any way by the process itself. However we might check stuff like
Liveness probe won't help, it's more for "blocked" kind of problem
I think the prometheus node-exporter exposes the containers memory usage ? Not sure if it's included on Openshift ? |
I think we should turn this into an operational issue: let's observe and alarm, then adjust. if we can see how many pods get oom killed, at which memory limit, we should be able to manually adjust the limit. using dependency monkey to generate a larger set of observations. having statistics about the frequency of ook kills in relation to the memory consumption should give us a good priority for further actions/implementation/improvements. if we see a oom killed advise, can we adjust the memory limit and rerun it? |
I think we should turn this into an operational issue: let's observe and alarm, then adjust.
+1
if we see a oom killed advise, can we adjust the memory limit and rerun it?
Sure, if we are the producer (aka talking to the k8s api) of the pod
resource or it's parent resource(workflow, job, whatever). (In that case
we should probably have a `RestartPolicy: Never` to handle failures
ourselves)
|
So we have three separate things:
From what I understood @goern , we would do 1 to see if we need 2 and/or 3 ? |
> if we see a oom killed advise, can we adjust the memory limit and rerun it?
Sure, if we are the producer (aka talking to the k8s api) of the pod
resource or it's parent resource(workflow, job, whatever). (In that case
we should probably have a `RestartPolicy: Never` to handle failures
ourselves)
To clarify a bit of my earlier statement, technically we would not rerun
it, but create another one (pods/parent resource) with adjusted
resources. But it's functionally equivalent
|
Describe the bug
When i run an advise report on a large stack, I get an expected OOM error. However there is no
stack_info
returned. instead report is null and I only get an error message.To Reproduce
https://khemenu.thoth-station.ninja/api/v1/advise/python/adviser-211209191901-3de39ce57cac2f07
Expected behavior
To see a populated
stack_info
object under report in the document.Additional context
This may or may not be a bug, but it would be helpful to see how the report errored out in more details besides looking at the logs.
The text was updated successfully, but these errors were encountered: