Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of docs: clarify reschedule, migrate, and replacement terminology into release/1.8.x #25144

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #24929 to be assessed for backporting due to the inclusion of the label backport/1.8.x.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@tgross
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/nomad/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.


Our vocabulary around scheduler behaviors outside of the reschedule and migrate blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover:

  • restart: when the tasks of an allocation fail and we try to restart the tasks in place.

  • reschedule: when the restart block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again.

  • migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker.

  • replacement: when a node is lost, we don't count that against the reschedule tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the migrate machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the group.count

  • replacement for disconnect.replace = true: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the reschedule, migrate, and disconnect blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa.

Fixes: #24918


major preview links:


Overview of commits

Copy link

hashicorp-cla-app bot commented Feb 18, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes


temp seems not to be a GitHub user.
You need a GitHub account to be able to sign the CLA.
If you have already a GitHub account, please add the email address used for this commit to your account.

Have you signed the CLA already but the status is still pending? Recheck it.

Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: #24918
Co-authored-by: Aimee Ukasick <[email protected]>
@tgross tgross force-pushed the backport/docs-replacement-vs-reschedule/loosely-prompt-jay branch from 020f697 to 9db9e62 Compare February 18, 2025 14:58
@tgross tgross merged commit 9f6a2f6 into release/1.8.x Feb 18, 2025
27 checks passed
@tgross tgross deleted the backport/docs-replacement-vs-reschedule/loosely-prompt-jay branch February 18, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants