[train] refactor WorkerGroup state #50181

matthewdeng · 2025-02-03T01:53:36Z

Overview

This PR refactors Ray Train's WorkerGroup state management to make state transitions (start, shutdown) atomic and to make the code more structured.

Key Changes

Split worker group state into separate components:
- WorkerGroupContext: Stores configuration used to start a worker group
- WorkerGroupState: Stores runtime state of an active worker group
- WorkerGroupPollStatus: Stores polling results from workers
Introduced a builder pattern for worker group state management:
- Added WorkerGroupStateBuilder to handle incremental state construction during WorkerGroup.start()
- Improved error handling during worker group startup
Moved worker status and polling logic to dedicated modules:
- Created new state.py module for state management classes
- Created new poll.py module for polling-related classes
Renamed WorkerGroupStatus to WorkerGroupPollStatus to make it more explicit
- TODO: Do this for WorkerStatus as well → WorkerPollStatus

Benefits

Clearer separation of concerns between configuration, runtime state, and polling status
Cleaner state management during worker group lifecycle
Improved code organization and maintainability
Better error handling during worker group startup and shutdown

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Deng <[email protected]>

hongpeng-guo · 2025-02-03T19:39:00Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

-            self._latest_start_time = time_monotonic()
+            self._worker_group_state_builder.with_start_time(time_monotonic())
+            self._worker_group_state = self._worker_group_state_builder.build()
+            self._worker_group_state_builder = None


Why do we clear this self._worker_group_state_builder here? Do we still need it to clear the state when shutting down the worker group?

That is a great question. It is a bit duplicative right now and I want to clean it up more... thinking that WorkerGroup.shutdown should only ever touch the worker_group_state and worker_group_state_builder creation/teardown logic should all be contained within WorkerGroup.create.

hongpeng-guo · 2025-02-03T19:40:17Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

@@ -425,69 +348,52 @@ def shutdown(self, patience_s: float = 5.0):
        with invoke_context_managers(
            [callback.on_worker_group_shutdown for callback in self._callbacks]
        ):
-            if self._workers:
+            if self._worker_group_state_builder:


Echoing the previous comments. why do we need to check self._worker_group_state_builder here?

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng · 2025-02-05T00:52:54Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

+        worker_group_context = WorkerGroupContext(
+            num_workers=num_workers,
+            resources_per_worker=resources_per_worker,
+        )
+        self._worker_group_context = worker_group_context


Note: This isn't used right now.

matthewdeng · 2025-02-05T00:54:11Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

-        node_id_to_workers = collections.defaultdict(list)
+        # Launch the training function on each worker.
+        # This task should start a worker thread and return immediately.
+        ray_get_safe([worker.actor.run_train_fn.remote(train_fn) for worker in workers])


Do we need to try/catch this?

BTW, The ray_get_safe seems not needed anymore. The original issue has been solved. see the issue and PR

I think we can delete the ray_get_safe from ray train codebase, and just using ray.get() for now.

For the original question, if any worker raises, we will get the error here, using try ... catch ... here mostly only catches the bug of the ray.get() or ray_get_safe function itself. I think it should be fine to not use try catch here.

matthewdeng · 2025-02-05T01:06:31Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

-    def __len__(self) -> int:
-        return len(self._workers)
+    #####################################################################################
+    # Utility Methods


Not sure if these methods should exist at the existing WorkerGroup level or WorkerGroupState.

Essentially wondering if we should split between an "inactive" and "active" WorkerGroup, with specific methods for each. Can make it more clear for the Caller if it is handling in the active vs. inactive state, and avoid branching logic in the Worker Group layer to check if it's active everywhere.

Something like:

~~WorkerGroup~~ InactiveWorkerGroup/WorkerGroupFactory

create() → ActiveWorkerGroup

~~WorkerGroupState~~ ActiveWorkerGroup

poll()

shutdown()

__len__()

...

I am a little confused why WorkerGroup maps to a InactiveWorkerGroup and WorkerGroupState maps to a ActiveWorkerGroup. If I understand correctly, WorkerGroupState contains static information of a worker group. Could you explain more?

The idea is that the Controller would start with the InactiveWorkerGroup (maybe WorkerGroupFactory is a better name). Calling start() would return an ActiveWorkerGroup which contains all the state that is now captured in WorkerGroupState, plus additional methods that can be called on an active WorkerGroup, e.g. poll(), shutdown(), execute(), ....

I think that makes sense. I originally tried doing something like this but couldn't figure out a good way.

Then, we don't need to worry about calling inappropriate methods depending on active status of worker group.

matthewdeng · 2025-02-05T01:07:08Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

+    @staticmethod
+    def _sort_workers_by_node_id_and_gpu_id(


Made these static because they're just generic utilities that don't need/modify any WorkerGroup state

hongpeng-guo · 2025-02-05T01:34:49Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py


    def get_workers(self) -> List[Worker]:
-        return self._workers
+        # TODO: Access workers through WorkerGroupState instead?


I think either way works. slightly prefer to get it from WorkerGroupState

Ya that's the same feeling I had which motivated #50181 (comment). Basically just have the caller directly work with the "active" WorkerGroup rather than needing to support the inactive case within this method, or to implement logic in the caller to check if it's active.

hongpeng-guo

Nice! overall LGTM, left a few comments.

justinvyu

Here's my initial high level pass.

justinvyu · 2025-02-05T02:16:53Z

python/ray/train/v2/_internal/execution/worker_group/state.py

+    def with_sync_actor(
+        self, sync_actor: SynchronizationActor
+    ) -> "WorkerGroupStateBuilder":
+        self.sync_actor = sync_actor
+        return self


I was thinking the sync_actor could be moved to be initialized/shut down by a SyncActorCallback. The sync actor is only used by the checkpoint module at the moment, so we could package it with that module rather than have it be a generic worker group concept.

I can see how a sync actor would be helpful as a util for users and future Ray Train features though.

The question is, what state should be managed by the WorkerGroup vs auxiliary callbacks?

Good question, I had similar thoughts in the other direction for the Datasets sharding logic... right now I haven't seen any overwhelming evidence for one answer or another.

I do agree that it would be helpful as a generic utility.

I prefer the data callback and checkpointing logic to be separated from the controller and worker group as much as possible. I found the backend_executor pretty hard to work with previously, since everything was being handled together (checkpoint logic, backend setup, ray data logic).

Ah agreed! We should definitely strive to make all of these as modular and pluggable as possible (which they aren't yet). I just wasn't sure if the callbacks are the right interface.

It seems the sync_actor is only used by ray.train.report function. If it becomes a callback, I think the report function will need to subscribe to this callback and use its broadcast function inside ray.train.report. Not sure if we should build an important API report on a pluggable component. I feel callback should be a self-contained module. We can discuss the best practices offline.

python/ray/train/v2/_internal/execution/worker_group/worker.py

justinvyu · 2025-02-05T02:21:15Z

python/ray/train/v2/_internal/execution/worker_group/state.py

+    def _shutdown_sync_actor(self, sync_actor: SynchronizationActor):
+        ray.kill(sync_actor)


For example, I find this sync actor lifecycle management to be a bit out of place here. Worker group should focus on its workers, not some random utility actor.

That's a good point, I was thinking it was there to allow us to do some sort of WorkerGroup synchronization.

python/ray/train/v2/_internal/execution/worker_group/state.py

justinvyu · 2025-02-05T02:33:43Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

+        except Exception as e:
+            if not self.has_started():
+                worker_group_state_builder.shutdown()
+            raise e


Why conditional on if it has not started? Is it even possible for has_started to be True here?

My train of thought for this was that once has_started is True, then there is a WorkerGroupState which should be acted on, and the Builder should basically be discarded and not operated on anymore.

Right now in the code it should not return True here. This could also be changed to assert not self.has_started().

justinvyu · 2025-02-05T02:35:37Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

-    def __len__(self) -> int:
-        return len(self._workers)
+    #####################################################################################
+    # Utility Methods


I think that makes sense. I originally tried doing something like this but couldn't figure out a good way.

Then, we don't need to worry about calling inappropriate methods depending on active status of worker group.

python/ray/train/v2/_internal/execution/worker_group/state.py

Signed-off-by: Matthew Deng <[email protected]>

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

justinvyu · 2025-02-06T21:46:08Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

+            List[Union[WorkerGroupCallback, WorkerCallback, TrainContextCallback]]
+        ] = None,
+        placement_strategy: str = "PACK",
+        checkpoint: Optional[Checkpoint] = None,


TODO: In a followup PR, remove checkpoint from worker group constructor and move it to CheckpointManager, and populate worker context similar to how the dataset shards are passed to the workers.

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/tests/test_worker_group.py

python/ray/train/v2/tests/test_controller.py

justinvyu · 2025-02-06T22:02:40Z

python/ray/train/v2/_internal/execution/callback.py

    def before_init_train_context(
-        self, worker_group: "WorkerGroup"
+        self, workers: List["Worker"]


Why change this to workers?

This is because the WorkerGroup itself is still in the middle of its own creation stage.

Calling WorkerGroup.execute would fail here because the WorkerGroup is not yet "active" when calling this step. So I wanted to constrain this callback hook to what is "ready" which is the workers.

Got it, this callback is the only one that runs into that problem.

Signed-off-by: Matthew Deng <[email protected]>

python/ray/train/v2/_internal/execution/controller/controller.py

hongpeng-guo · 2025-02-07T06:48:53Z

python/ray/train/v2/_internal/execution/controller/controller.py

+            num_workers=num_workers,
+            resources_per_worker=resources_per_worker,
+            placement_strategy=placement_strategy,
+            checkpoint=latest_checkpoint,


I am a bit concerned about putting checkpoint here. It seems all other fields are static, i.e., won't change in a workergroup life time. However, lastest_checkpoint will be updated every time we submitted a new checkpoint. Maybe we can migrate this field to another component in a follow-up PR.

hongpeng-guo

Nice. added a few comments!

Signed-off-by: Matthew Deng <[email protected]>

justinvyu

Nice!

python/ray/train/v2/_internal/execution/controller/controller.py

justinvyu · 2025-02-07T19:44:13Z

python/ray/train/v2/_internal/execution/controller/controller.py

@@ -283,16 +284,22 @@ def _restart_worker_group(
        )
        placement_strategy = self._scaling_policy.scaling_config.placement_strategy

+        worker_group_context = WorkerGroupContext(


Do you need to add run attempt ID as part of this context?

I opted to not put it in yet since we don't do anything with it yet. I will add it when it's used (probably in the upcoming state management PR)

justinvyu · 2025-02-07T19:46:20Z

python/ray/train/v2/tests/test_controller.py

+    @classmethod
+    def set_start_failure(cls, start_failure):
+        cls._start_failure = start_failure


curious why this was needed?

I needed to do this because the instance of the WorkerGroup is instantiated internally within the controller and changes over time. In test_worker_group_start_failure I had to inject this failure logic somehow so I did it by setting this and monkeypatching it.

python/ray/train/v2/tests/test_worker_group.py

Signed-off-by: Matthew Deng <[email protected]>

#50181 updated the internal `WorkerGroup`, which impacted the `ScalingPolicy` and `FailurePolicy` input APIs. Note that these are all internal-facing developer APIs. This PR restores parity to the information available to the `ScalingPolicy` in the `make_decision_for_running_worker_group` method. In particular, this PR exposes the `WorkerGroupState`, which contains the latest worker group's `start_time`. --------- Signed-off-by: Justin Yu <[email protected]>

matthewdeng added 10 commits January 28, 2025 13:25

[train] refactor TrainControllerState

4d242ac

Signed-off-by: Matthew Deng <[email protected]>

remove arg

22043a2

Signed-off-by: Matthew Deng <[email protected]>

remove arg

4cc0ae0

Signed-off-by: Matthew Deng <[email protected]>

wip

3d1d90c

Signed-off-by: Matthew Deng <[email protected]>

address comments

86c2850

Signed-off-by: Matthew Deng <[email protected]>

address comments

e98c9ba

Signed-off-by: Matthew Deng <[email protected]>

lint

43aa9d6

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'v2/controller' into v2/workergroup

779ec2e

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into v2/workergroup

5a0919d

Signed-off-by: Matthew Deng <[email protected]>

update worker group

af1913d

Signed-off-by: Matthew Deng <[email protected]>

hongpeng-guo self-assigned this Feb 3, 2025

hongpeng-guo reviewed Feb 3, 2025

View reviewed changes

matthewdeng added 5 commits February 3, 2025 19:13

update start logic

ce8cf43

Signed-off-by: Matthew Deng <[email protected]>

poll

1840f1c

Signed-off-by: Matthew Deng <[email protected]>

builder

95505c2

Signed-off-by: Matthew Deng <[email protected]>

fix

87973ae

Signed-off-by: Matthew Deng <[email protected]>

fix tests

3ce7b54

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng commented Feb 5, 2025

View reviewed changes

hongpeng-guo reviewed Feb 5, 2025

View reviewed changes

hongpeng-guo approved these changes Feb 5, 2025

View reviewed changes

justinvyu reviewed Feb 5, 2025

View reviewed changes

matthewdeng added 2 commits February 5, 2025 16:43

update state

34e6e97

Signed-off-by: Matthew Deng <[email protected]>

use static create method

047043b

Signed-off-by: Matthew Deng <[email protected]>

justinvyu reviewed Feb 6, 2025

View reviewed changes

matthewdeng added 3 commits February 6, 2025 17:13

address comments

fc3ba0c

Signed-off-by: Matthew Deng <[email protected]>

address comments

5fbac32

Signed-off-by: Matthew Deng <[email protected]>

tests

e416c2e

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng requested review from justinvyu and hongpeng-guo February 7, 2025 02:23

hongpeng-guo reviewed Feb 7, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

hongpeng-guo reviewed Feb 7, 2025

View reviewed changes

hongpeng-guo approved these changes Feb 7, 2025

View reviewed changes

fix tests

1f493be

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng marked this pull request as ready for review February 7, 2025 18:48

matthewdeng requested review from raulchen and woshiyyya as code owners February 7, 2025 18:48

justinvyu approved these changes Feb 7, 2025

View reviewed changes

address comments

9a76447

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng enabled auto-merge (squash) February 7, 2025 21:44

github-actions bot added the go add ONLY when ready to merge, run all tests label Feb 7, 2025

matthewdeng merged commit b4279f3 into ray-project:master Feb 7, 2025
7 checks passed

justinvyu mentioned this pull request Feb 10, 2025

[train v2] Expose WorkerGroupState to the ScalingPolicy #50388

Merged

		def _shutdown_sync_actor(self, sync_actor: SynchronizationActor):
		ray.kill(sync_actor)

[train] refactor WorkerGroup state #50181

[train] refactor WorkerGroup state #50181

Conversation

matthewdeng commented Feb 3, 2025

Overview

Key Changes

Benefits

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo left a comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo left a comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo Feb 3, 2025 •

edited

Loading

matthewdeng Feb 5, 2025 •

edited

Loading

hongpeng-guo Feb 7, 2025 •

edited

Loading