-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argoexec init contariner meet OOMKilled issue #11223
Comments
have u got large input artifact? I used to have OOM on it but setting my init container to 160MB fixed it |
I'm not setting the large input artifact. |
To clarify a bit further (I am a coworker of @taoc2021 ), if we do a deploy, on most of our clusters, it works fine. What factors can cause increased argoexec memory usage? We noticed a 3 MB memory increase in endpoints init compared to the other clusters, does the memory used scale with the amount of endpoints in the namespace? |
@juliev0 We are seeing the same thing, specifically in our KubeFlow pipelines which take two inputs as KubeFlow InputPath objects. Basically just dynamically resolved paths. We expect our init container running Argoexec to download these artifacts specified by the input paths (from S3). However, we get an OOM. It appears that Argoexec is loading these artifacts in their entirety into memory, while we expected the behavior to be that these artifacts are streamed (and chunked). Can you clarify if this is indeed the behavior of Argoexec? |
Oh, I think it may be that the shared volume on the Pod (so main container and init container can share these input files) is of
|
That's a interesting consideration. Our EmptyDir is specified with |
How did you set the initcontainer memory? |
We just ran into same issue. I tried to investigate more, but wasn't able to found anything as of init is running for a short time and it's hard to debug. Describe give info that volumes:
So it's not But with size of artifacts init starts failing. |
I'd expect larger objects to require more memory during unarchiving/decompression, so OOM seems reasonable. You could prove it by disabling artifact compression archive:
none: {} |
How did you set the new resources settings for the initContainer? |
We are using the podSpecPatch to patch the containers. For templates that launch a "work" container, it looks like this: templates:
- name: main
podSpecPatch: |
initContainers:
- name: init
resources:
requests:
cpu: 100m
memory: 400Mi
limits:
memory: 900Mi
containers:
- name: wait
resources:
requests:
cpu: 100m
memory: 400Mi
limits:
memory: 900Mi
steps: For a template that creates a resource, it looks like this: - name: re-index-a
inputs:
parameters:
- name: district
- name: process-time
retryStrategy:
limit: 3
synchronization:
semaphore:
configMapKeyRef:
name: controller-synchronization
key: re-index-a
podSpecPatch: |
initContainers:
- name: init
resources:
requests:
cpu: 100m
memory: 400Mi
limits:
memory: 900Mi
containers:
- name: main
resources:
requests:
cpu: 100m
memory: 400Mi
limits:
memory: 900Mi
resource:
action: create
manifest: |
But the issue that we are seeing is that the memory requirements for the containers in the pods seem awfully large. Generally, they start with just under 400Mi and sometimes go up from that. I wouldn't think the command should need that much memory. |
Pre-requisites
:latest
What happened/what you expected to happen?
The argoexec init container OOMKilled.
I think that 64Mi memory is enough for init container.
We are looking for the right limit to set and that we check the differences between environments but we cannot correlate.
Need argo workflow team to check why argoexec init container use more memory than 64Mi. it's abnormal.
And which cause the init container use more memory?
Version
V3.4.8 or lastest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: