Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advise does not give a stack_info when OOM is thrown #2203

Open
Gkrumbach07 opened this issue Dec 10, 2021 · 16 comments
Open

Advise does not give a stack_info when OOM is thrown #2203

Gkrumbach07 opened this issue Dec 10, 2021 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.

Comments

@Gkrumbach07
Copy link
Member

Describe the bug
When i run an advise report on a large stack, I get an expected OOM error. However there is no stack_info returned. instead report is null and I only get an error message.

To Reproduce
https://khemenu.thoth-station.ninja/api/v1/advise/python/adviser-211209191901-3de39ce57cac2f07

Expected behavior
To see a populated stack_info object under report in the document.

Additional context
This may or may not be a bug, but it would be helpful to see how the report errored out in more details besides looking at the logs.

@Gkrumbach07 Gkrumbach07 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2021
@goern
Copy link
Member

goern commented Dec 21, 2021

/assign @fridex
/priority important-soon

@sesheta sesheta added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 21, 2021
@goern
Copy link
Member

goern commented Feb 22, 2022

@Gkrumbach07 is this something still happening?

@goern goern added the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 22, 2022
@Gkrumbach07
Copy link
Member Author

@Gkrumbach07 is this something still happening?

I havent seen it in awhile, but I have been avoiding large runs. Is related to #1525 and #2204

If the issue above is resolved then this issue can be resolved.

@fridex
Copy link
Contributor

fridex commented Feb 23, 2022

I was thinking about a solution for this. We could add inter-process communication that would guarantee that stack_info is propagated correctly to the parent process once the child process is killed on OOM. I don't think that's the right way to do these type of things though (moreover we can have results computed even if OOM happens). I think the underlying platform (OpenShift) should provide us a way to be notified when a pod is going to be killed. See #1525. We had discussions with Vasek a while back about this implementation, not sure if it continued and if OpenShift team is planning to add such support. It is not just our use case, but any memory expensive workloads run on Kubernetes/OpenShift could benefit from such a feature.

@goern Do you know people to reach out to support this feature?

@goern
Copy link
Member

goern commented Feb 28, 2022

dont we see 137 errors on OOM'd pods? like

   State:          Running
      Started:      Thu, 10 Oct 2019 11:14:13 +0200
    Last State:     Terminated
      Reason:       OOMKilled

so we can check for this in the argo workflow?

@fridex
Copy link
Contributor

fridex commented Feb 28, 2022

The problem here are users that will not get any recommendations if OOM is done. What we could do - if there is a risk of OOM, we could stop the resolution process and show users results that we have computed so far.

@sesheta
Copy link
Member

sesheta commented May 29, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2022
@sesheta
Copy link
Member

sesheta commented Jun 28, 2022

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 28, 2022
@mayaCostantini
Copy link
Contributor

/sig stack-guidance

@sesheta sesheta added the sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. label Jun 28, 2022
@mayaCostantini
Copy link
Contributor

Given the memory optimizer implementation and the default values used in adviser deployments as a limit above which to run the optimizer (5.75GB < 6GB currently allocated), this issue should normally be solved, unless a "burst" in memory usage happens in between batches of resolver iterations, with a batch size defined by the THOTH_ADVISER_MEM_OPTIMIZER_ITERATION environment variable in deployments (unlikely with the current value of 100 which is small).

Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?

cc @goern @harshad16

@mayaCostantini mayaCostantini removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 28, 2022
@mayaCostantini
Copy link
Contributor

Following a discussion with @harshad16, a way to improve the limit would be to

  • observe a spike in memory consumption when running an advise on a stack that caused OOM issues in previous integration tests
  • run advises for similar software stacks using dependency monkey and average the values obtained to better approximate the limit we should fix

In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly

@mayaCostantini mayaCostantini self-assigned this Jul 4, 2022
@VannTen
Copy link
Member

VannTen commented Sep 5, 2022

might be related : thoth-station/metrics-exporter#725

We could add inter-process communication that would guarantee that stack_info is propagated correctly to the parent process once the child process is killed on OOM.

I'm pretty sure this is impossible. The oom (either from cgroup or the kernel) send a SIGKILL, which cannot be handled in any way by the process itself.

However we might check stuff like resource.setrlimit then catching MemoryError ?

Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?

Liveness probe won't help, it's more for "blocked" kind of problem

In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly

I think the prometheus node-exporter exposes the containers memory usage ? Not sure if it's included on Openshift ?

@goern
Copy link
Member

goern commented Sep 5, 2022

I think we should turn this into an operational issue: let's observe and alarm, then adjust.

if we can see how many pods get oom killed, at which memory limit, we should be able to manually adjust the limit. using dependency monkey to generate a larger set of observations. having statistics about the frequency of ook kills in relation to the memory consumption should give us a good priority for further actions/implementation/improvements.

if we see a oom killed advise, can we adjust the memory limit and rerun it?

@VannTen
Copy link
Member

VannTen commented Sep 5, 2022 via email

@VannTen
Copy link
Member

VannTen commented Sep 22, 2022

So we have three separate things:

  • observe -> create alerts based on exit-code for OOM + investigate on memory consumption in that case.
    -> this depends on Have a metric that introspects why pods failed in the cluster metrics-exporter#725
    (although, it might be possible to use memory consumption alone to detect this. I believe grafana has a sort of "restart detector", it might fit the job with some PromQl wizardry. I'd rather use the codes if available ^^)
    (in that case we would manually adjust the memory limits in the manifests).
  • Have the pod producer consume the metrics to be able to re-create the job/workflow with larger memory limit
  • Turn the OOM into a MemoryException exception by using setrlimit so we can do something from python code (or just have a strack track in the logs)

From what I understood @goern , we would do 1 to see if we need 2 and/or 3 ?

@codificat codificat moved this to 📋 Backlog in Planning Board Sep 26, 2022
@VannTen
Copy link
Member

VannTen commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.
Projects
Status: 📋 Backlog
Development

No branches or pull requests

6 participants