-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orphaned allocations when stopping a job with failures #8475
Comments
Thank you so much for reaching out! I suspect this is a manifestation of the issue #6557 . Do you only observe this when the allocs are failing due to host resource utilization (e.g. file limits)? What about a consistently failing jobs (e.g. having a job that always |
You're right, that sounds almost exactly like #6557. I don't think my issue could be reproduced if jobs always fail, since I think it needs some successfully running instances to be orphaned, but I might try writing a toy app that has a 50% chance to fail on start and see if that helps us. |
Thank you again for the report! I'll close this ticket being a duplicate then. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.12.0 (8f7fbc8)
Operating system and Environment details
Ubuntu 18.04 on AWS. Nomad clients are m5.large instances. We have six clients, but I don't think that matters due to the default binpacking placement strategy.
Issue
After deploying a job in which some tasks are stuck failing, and then stopping that job, some allocations are left running. The job status is dead, and the allocation DesiredStatus is "stop", but these orphan allocations have not actually been stopped.
Reproduction steps
Job file (if appropriate)
This is basically just the example redis job, except with a much higher count, and lower allocation requirements (so more can be packed into a single client).
Nomad logs (if appropriate)
Will send to support email.
The text was updated successfully, but these errors were encountered: