Advise does not give a stack_info when OOM is thrown #2203

Gkrumbach07 · 2021-12-10T14:40:19Z

Describe the bug
When i run an advise report on a large stack, I get an expected OOM error. However there is no stack_info returned. instead report is null and I only get an error message.

To Reproduce
https://khemenu.thoth-station.ninja/api/v1/advise/python/adviser-211209191901-3de39ce57cac2f07

Expected behavior
To see a populated stack_info object under report in the document.

Additional context
This may or may not be a bug, but it would be helpful to see how the report errored out in more details besides looking at the logs.

The text was updated successfully, but these errors were encountered:

goern · 2021-12-21T11:12:31Z

/assign @fridex
/priority important-soon

goern · 2022-02-22T09:10:07Z

@Gkrumbach07 is this something still happening?

Gkrumbach07 · 2022-02-22T15:16:23Z

@Gkrumbach07 is this something still happening?

I havent seen it in awhile, but I have been avoiding large runs. Is related to #1525 and #2204

If the issue above is resolved then this issue can be resolved.

fridex · 2022-02-23T18:28:17Z

I was thinking about a solution for this. We could add inter-process communication that would guarantee that stack_info is propagated correctly to the parent process once the child process is killed on OOM. I don't think that's the right way to do these type of things though (moreover we can have results computed even if OOM happens). I think the underlying platform (OpenShift) should provide us a way to be notified when a pod is going to be killed. See #1525. We had discussions with Vasek a while back about this implementation, not sure if it continued and if OpenShift team is planning to add such support. It is not just our use case, but any memory expensive workloads run on Kubernetes/OpenShift could benefit from such a feature.

@goern Do you know people to reach out to support this feature?

goern · 2022-02-28T10:25:43Z

dont we see 137 errors on OOM'd pods? like

   State:          Running
      Started:      Thu, 10 Oct 2019 11:14:13 +0200
    Last State:     Terminated
      Reason:       OOMKilled

so we can check for this in the argo workflow?

fridex · 2022-02-28T11:37:07Z

The problem here are users that will not get any recommendations if OOM is done. What we could do - if there is a risk of OOM, we could stop the resolution process and show users results that we have computed so far.

sesheta · 2022-05-29T12:42:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta · 2022-06-28T12:49:43Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

mayaCostantini · 2022-06-28T13:13:42Z

/sig stack-guidance

mayaCostantini · 2022-06-28T13:43:41Z

Given the memory optimizer implementation and the default values used in adviser deployments as a limit above which to run the optimizer (5.75GB < 6GB currently allocated), this issue should normally be solved, unless a "burst" in memory usage happens in between batches of resolver iterations, with a batch size defined by the THOTH_ADVISER_MEM_OPTIMIZER_ITERATION environment variable in deployments (unlikely with the current value of 100 which is small).

Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?

cc @goern @harshad16

mayaCostantini · 2022-06-28T14:49:20Z

Following a discussion with @harshad16, a way to improve the limit would be to

observe a spike in memory consumption when running an advise on a stack that caused OOM issues in previous integration tests
run advises for similar software stacks using dependency monkey and average the values obtained to better approximate the limit we should fix

In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly

VannTen · 2022-09-05T11:07:28Z

might be related : thoth-station/metrics-exporter#725

We could add inter-process communication that would guarantee that stack_info is propagated correctly to the parent process once the child process is killed on OOM.

I'm pretty sure this is impossible. The oom (either from cgroup or the kernel) send a SIGKILL, which cannot be handled in any way by the process itself.

However we might check stuff like resource.setrlimit then catching MemoryError ?

Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?

Liveness probe won't help, it's more for "blocked" kind of problem

In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly

I think the prometheus node-exporter exposes the containers memory usage ? Not sure if it's included on Openshift ?

goern · 2022-09-05T11:33:29Z

I think we should turn this into an operational issue: let's observe and alarm, then adjust.

if we can see how many pods get oom killed, at which memory limit, we should be able to manually adjust the limit. using dependency monkey to generate a larger set of observations. having statistics about the frequency of ook kills in relation to the memory consumption should give us a good priority for further actions/implementation/improvements.

if we see a oom killed advise, can we adjust the memory limit and rerun it?

VannTen · 2022-09-05T12:28:59Z

I think we should turn this into an operational issue: let's observe and alarm, then adjust.

+1

if we see a oom killed advise, can we adjust the memory limit and rerun it?

Sure, if we are the producer (aka talking to the k8s api) of the pod resource or it's parent resource(workflow, job, whatever). (In that case we should probably have a `RestartPolicy: Never` to handle failures ourselves)

VannTen · 2022-09-22T15:01:36Z

So we have three separate things:

observe -> create alerts based on exit-code for OOM + investigate on memory consumption in that case.
-> this depends on Have a metric that introspects why pods failed in the cluster metrics-exporter#725
(although, it might be possible to use memory consumption alone to detect this. I believe grafana has a sort of "restart detector", it might fit the job with some PromQl wizardry. I'd rather use the codes if available ^^)
(in that case we would manually adjust the memory limits in the manifests).
Have the pod producer consume the metrics to be able to re-create the job/workflow with larger memory limit
Turn the OOM into a MemoryException exception by using setrlimit so we can do something from python code (or just have a strack track in the logs)

From what I understood @goern , we would do 1 to see if we need 2 and/or 3 ?

VannTen · 2022-10-11T08:25:13Z

> if we see a oom killed advise, can we adjust the memory limit and rerun it? Sure, if we are the producer (aka talking to the k8s api) of the pod resource or it's parent resource(workflow, job, whatever). (In that case we should probably have a `RestartPolicy: Never` to handle failures ourselves)

To clarify a bit of my earlier statement, technically we would not rerun it, but create another one (pods/parent resource) with adjusted resources. But it's functionally equivalent

Gkrumbach07 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2021

sesheta assigned fridex Dec 21, 2021

sesheta added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 21, 2021

goern added the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 22, 2022

This was referenced Feb 28, 2022

Implement memory optimizer for resolver #2280

Merged

sprint production release v2022.03.14 thoth-station/thoth-application#2393

Closed

sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2022

sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 28, 2022

sesheta added the sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. label Jun 28, 2022

mayaCostantini unassigned fridex Jun 28, 2022

mayaCostantini removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 28, 2022

mayaCostantini self-assigned this Jul 4, 2022

codificat moved this to 📋 Backlog in Planning Board Sep 26, 2022

codificat added this to Planning Board Sep 26, 2022

VannTen mentioned this issue Oct 17, 2022

Analyzed unsolvable packages thoth-station/solver#5171

Open

3 tasks

codificat unassigned mayaCostantini Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advise does not give a stack_info when OOM is thrown #2203

Advise does not give a stack_info when OOM is thrown #2203

Gkrumbach07 commented Dec 10, 2021

goern commented Dec 21, 2021

goern commented Feb 22, 2022

Gkrumbach07 commented Feb 22, 2022

fridex commented Feb 23, 2022

goern commented Feb 28, 2022

fridex commented Feb 28, 2022

sesheta commented May 29, 2022

sesheta commented Jun 28, 2022

mayaCostantini commented Jun 28, 2022

mayaCostantini commented Jun 28, 2022

mayaCostantini commented Jun 28, 2022

VannTen commented Sep 5, 2022

goern commented Sep 5, 2022

VannTen commented Sep 5, 2022 via email

VannTen commented Sep 22, 2022

VannTen commented Oct 11, 2022 via email

Advise does not give a stack_info when OOM is thrown #2203

Advise does not give a stack_info when OOM is thrown #2203

Comments

Gkrumbach07 commented Dec 10, 2021

goern commented Dec 21, 2021

goern commented Feb 22, 2022

Gkrumbach07 commented Feb 22, 2022

fridex commented Feb 23, 2022

goern commented Feb 28, 2022

fridex commented Feb 28, 2022

sesheta commented May 29, 2022

sesheta commented Jun 28, 2022

mayaCostantini commented Jun 28, 2022

mayaCostantini commented Jun 28, 2022

mayaCostantini commented Jun 28, 2022

VannTen commented Sep 5, 2022

goern commented Sep 5, 2022

VannTen commented Sep 5, 2022 via email

VannTen commented Sep 22, 2022

VannTen commented Oct 11, 2022 via email