Backport of docs: clarify reschedule, migrate, and replacement terminology into release/1.8.x #25144
+113
−77
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #24929 to be assessed for backporting due to the inclusion of the label backport/1.8.x.
🚨
The person who merged in the original PR is:
@tgross
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.
The below text is copied from the body of the original PR.
Our vocabulary around scheduler behaviors outside of the
reschedule
andmigrate
blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover:restart: when the tasks of an allocation fail and we try to restart the tasks in place.
reschedule: when the
restart
block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again.migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker.
replacement: when a node is lost, we don't count that against the
reschedule
tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run themigrate
machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped thegroup.count
replacement for
disconnect.replace = true
: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker.Add a section to the
reschedule
,migrate
, anddisconnect
blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa.Fixes: #24918
major preview links:
Overview of commits