Argoexec init contariner meet OOMKilled issue #11223

taoc2021 · 2023-06-16T06:08:06Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

The argoexec init container OOMKilled.
I think that 64Mi memory is enough for init container.
We are looking for the right limit to set and that we check the differences between environments but we cannot correlate.

Need argo workflow team to check why argoexec init container use more memory than 64Mi. it's abnormal.
And which cause the init container use more memory?

Init Containers:
  init:
    Container ID:  containerd://3eb8993d0faaec1cd0862bde4e92e33558ca6e329e2e0c960d101a69ad4cc22b
    Image:         quay.io/argoproj/argoexec:latest
    Image ID:      quay.io/argoproj/argoexec@sha256:5cd920d5e57cdc0881f13a9b39238cea462eee59ea35a39f1142b7994a27a5fd
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      init
      --loglevel
      debug
      --log-format
      text
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 12 Jun 2023 16:19:37 +0800
      Finished:     Mon, 12 Jun 2023 16:19:38 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  64Mi
    Requests:
      cpu:     100m
      memory:  64Mi

Version

V3.4.8 or lastest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

argo workflow configmap set as this:
 kubectl get cm workflow-controller-configmap -o yaml -n argo
apiVersion: v1
data:
  executor: |
    imagePullPolicy: IfNotPresent
    resources:
      requests:
        cpu: 0.1
        memory: 64Mi
      limits:
        cpu: 0.5
        memory: 64Mi
    args:
    - --loglevel
    - debug
    - --gloglevel
    - "6"

Logs from the workflow controller

time="2023-06-16T06:01:51.629Z" level=info msg="node changed" namespace=XXXXX
new.message="OOMKilled (exit code 137)" new.phase=Failed new.progress=0/1 nodeID=XXXXXX old.message= old.phase=Pending old.progress=0/1 workflow=XXXXXX
time="2023-06-16T06:01:51.631Z" level=info msg="node XXXXXX message: OOMKilled (exit code 137)" namespace=XXX workflow=XXX.." ... level=deb type: 'Warning' reason: 'WorkflowNodeFailed' Failed node XXXXXXX: OOMKilled (exit code 137)"

Logs from in your workflow's wait container

k logs podname -n xxx -c init
time="2023-06-09T08:32:45.385Z" level=info msg="Starting Workflow Executor" version=untagged
time="2023-06-09T08:32:45.454Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-06-09T08:32:45.454Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false 。。。,Platform:linux/amd64,}"

and no other log to print.(because print this pod is OOMKilled)

The text was updated successfully, but these errors were encountered:

tooptoop4 · 2023-06-17T07:18:51Z

have u got large input artifact? I used to have OOM on it but setting my init container to 160MB fixed it

taoc2021 · 2023-06-19T01:37:30Z

I'm not setting the large input artifact.

Mathiasdm · 2023-06-19T07:45:07Z

To clarify a bit further (I am a coworker of @taoc2021 ), if we do a deploy, on most of our clusters, it works fine.
But on specific clusters, we get a higher amount of memory usage (resulting in OOMKilled).

What factors can cause increased argoexec memory usage? We noticed a 3 MB memory increase in endpoints init compared to the other clusters, does the memory used scale with the amount of endpoints in the namespace?

juliev0 · 2023-06-22T17:53:55Z

@taoc2021 You can take a look at the code starting from here and see if you can find anything that could relate to your issue.

mysterious-progression · 2023-09-15T16:08:20Z

@juliev0 We are seeing the same thing, specifically in our KubeFlow pipelines which take two inputs as KubeFlow InputPath objects. Basically just dynamically resolved paths.

We expect our init container running Argoexec to download these artifacts specified by the input paths (from S3). However, we get an OOM. It appears that Argoexec is loading these artifacts in their entirety into memory, while we expected the behavior to be that these artifacts are streamed (and chunked).

Can you clarify if this is indeed the behavior of Argoexec?

juliev0 · 2023-09-15T17:46:42Z

Here I see it is downloading it to a file, not loading it into memory. For S3 that's ultimately calling this. (Not sure how that one operates under the hood.)

juliev0 · 2023-09-15T19:55:13Z

Oh, I think it may be that the shared volume on the Pod (so main container and init container can share these input files) is of EmptyDir type which could be storing in memory? I'm seeing this in the k8s emptyDir docs:

The emptyDir.medium field controls where emptyDir volumes are stored. By default emptyDir volumes are stored on whatever medium that backs the node such as disk, SSD, or network storage, depending on your environment. If you set the emptyDir.medium field to "Memory", Kubernetes mounts a tmpfs (RAM-backed filesystem) for you instead. While tmpfs is very fast, be aware that unlike disks, tmpfs is cleared on node reboot and any files you write count against your container's memory limit.

mysterious-progression · 2023-09-15T21:05:15Z

That's a interesting consideration. Our EmptyDir is specified with
EmptyDir: {}
So I'd expect it to be saved to disk, unless there is some way it could be still using RAM. Will see if I can verify

ravi-dd · 2024-01-30T17:42:35Z

have u got large input artifact? I used to have OOM on it but setting my init container to 160MB fixed it

How did you set the initcontainer memory?

llukaspl · 2024-03-07T09:43:14Z

That's a interesting consideration. Our EmptyDir is specified with EmptyDir: {} So I'd expect it to be saved to disk, unless there is some way it could be still using RAM. Will see if I can verify

We just ran into same issue. I tried to investigate more, but wasn't able to found anything as of init is running for a short time and it's hard to debug. Describe give info that volumes:

  input-artifacts:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>

So it's not Medium: Memory

But with size of artifacts init starts failing.

Joibel · 2024-03-07T10:00:21Z

I'd expect larger objects to require more memory during unarchiving/decompression, so OOM seems reasonable. You could prove it by disabling artifact compression compressionLevel: 0 or removing "archiving" (the act of putting all the artifacts into a tar container)

archive:
  none: {}

Yarin-Shitrit · 2024-06-02T08:42:56Z

have u got large input artifact? I used to have OOM on it but setting my init container to 160MB fixed it

How did you set the new resources settings for the initContainer?

RobCannon · 2024-08-07T21:48:33Z

We are using the podSpecPatch to patch the containers. For templates that launch a "work" container, it looks like this:

  templates:
    - name: main
      podSpecPatch: |
        initContainers:
          - name: init
            resources:
              requests:
                cpu: 100m
                memory: 400Mi
              limits:
                memory: 900Mi
        containers:
          - name: wait
            resources:
              requests:
                cpu: 100m
                memory: 400Mi
              limits:
                memory: 900Mi
      steps:

For a template that creates a resource, it looks like this:

    - name: re-index-a
      inputs:
        parameters:
          - name: district
          - name: process-time
      retryStrategy:
        limit: 3
      synchronization:
        semaphore:
          configMapKeyRef:
            name: controller-synchronization
            key: re-index-a
      podSpecPatch: |
        initContainers:
          - name: init
            resources:
              requests:
                cpu: 100m
                memory: 400Mi
              limits:
                memory: 900Mi
        containers:
          - name: main
            resources:
              requests:
                cpu: 100m
                memory: 400Mi
              limits:
                memory: 900Mi
      resource:
        action: create
        manifest: |

But the issue that we are seeing is that the memory requirements for the containers in the pods seem awfully large. Generally, they start with just under 400Mi and sometimes go up from that. I wouldn't think the command should need that much memory.

taoc2021 added the type/bug label Jun 16, 2023

taoc2021 closed this as completed Jun 16, 2023

taoc2021 reopened this Jun 16, 2023

juliev0 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Jun 22, 2023

agilgur5 added area/executor area/artifacts S3/GCP/OSS/Git/HDFS etc P3 Low priority and removed problem/more information needed Not enough information has been provide to diagnose this issue. labels Oct 5, 2023

agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argoexec init contariner meet OOMKilled issue #11223

Argoexec init contariner meet OOMKilled issue #11223

taoc2021 commented Jun 16, 2023 •

edited by agilgur5

Loading

tooptoop4 commented Jun 17, 2023

taoc2021 commented Jun 19, 2023

Mathiasdm commented Jun 19, 2023

juliev0 commented Jun 22, 2023

mysterious-progression commented Sep 15, 2023

juliev0 commented Sep 15, 2023 •

edited by agilgur5

Loading

juliev0 commented Sep 15, 2023 •

edited by agilgur5

Loading

mysterious-progression commented Sep 15, 2023

ravi-dd commented Jan 30, 2024

llukaspl commented Mar 7, 2024 •

edited by agilgur5

Loading

Joibel commented Mar 7, 2024

Yarin-Shitrit commented Jun 2, 2024

RobCannon commented Aug 7, 2024 •

edited by agilgur5

Loading

Argoexec init contariner meet OOMKilled issue #11223

Argoexec init contariner meet OOMKilled issue #11223

Comments

taoc2021 commented Jun 16, 2023 • edited by agilgur5 Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

tooptoop4 commented Jun 17, 2023

taoc2021 commented Jun 19, 2023

Mathiasdm commented Jun 19, 2023

juliev0 commented Jun 22, 2023

mysterious-progression commented Sep 15, 2023

juliev0 commented Sep 15, 2023 • edited by agilgur5 Loading

juliev0 commented Sep 15, 2023 • edited by agilgur5 Loading

mysterious-progression commented Sep 15, 2023

ravi-dd commented Jan 30, 2024

llukaspl commented Mar 7, 2024 • edited by agilgur5 Loading

Joibel commented Mar 7, 2024

Yarin-Shitrit commented Jun 2, 2024

RobCannon commented Aug 7, 2024 • edited by agilgur5 Loading

taoc2021 commented Jun 16, 2023 •

edited by agilgur5

Loading

juliev0 commented Sep 15, 2023 •

edited by agilgur5

Loading

juliev0 commented Sep 15, 2023 •

edited by agilgur5

Loading

llukaspl commented Mar 7, 2024 •

edited by agilgur5

Loading

RobCannon commented Aug 7, 2024 •

edited by agilgur5

Loading