[Stateful DL] Add out of order implementation #1423

michael-diggin · 2025-01-20T08:09:06Z

Changes

Adds the changes from Add option in data loader for out of order data pytorch#141833 and Dataloader distribute tasks to workers when in_order is False pytorch#142324 to the Stateful DL
Adds a warning log to mention that state management may not work as expected.

The tests I've added were the exact same ones from the main PyTorch repo, so the functionality of the flag is the exact same. There was a few changes required to be compatible with the few slight differences of the _StatefulMultiProcessingDataLoaderIter.

I have also tested some simple cases of stopping/resuming both an index and iterable ds and it seems to work correctly (the same types that are used in the new test cases). I haven't added extra tests for that into this PR just to keep the size manageable, but would add them in a follow up PR as well as checking extra edge cases.

facebook-github-bot · 2025-01-20T08:09:12Z

Hi @michael-diggin!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2025-01-20T09:06:15Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

andrewkho · 2025-01-22T00:46:44Z

@michael-diggin thanks for making this PR! Can add a warning on the first call to _StatefulMultiProcessingIter.state_dict() if in_order=False please?

pytorch-bot · 2025-01-22T01:06:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1423

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (13 Unrelated Failures)

As of commit 43622b5 with merge base 4ec4548 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Run Nodes Tests / test (macos-latest, 3.10) (gh) (similar failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (macos-latest, 3.11) (gh) (similar failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Nodes Tests / test (macos-latest, 3.12) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (macos-latest, 3.9) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (ubuntu-latest, 3.10) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (ubuntu-latest, 3.11) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (ubuntu-latest, 3.12) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (ubuntu-latest, 3.13) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (ubuntu-latest, 3.9) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (windows-latest, 3.10) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (windows-latest, 3.11) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (windows-latest, 3.12) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted
Run Nodes Tests / test (windows-latest, 3.9) (gh) (trunk failure)
test/nodes/test_multi_node_weighted_sampler.py::TestMultiNodeWeightedSampler::test_multi_node_weighted_sampler_first_exhausted

This comment was automatically generated by Dr. CI and updates every 15 minutes.

michael-diggin · 2025-01-22T07:46:38Z

@michael-diggin thanks for making this PR! Can add a warning on the first call to _StatefulMultiProcessingIter.state_dict() if in_order=False please?

No problem, done. I've gone with having it log on every call, but happy to add some state to track if it's the first call and to check that/only log once.

michael-diggin added 2 commits January 19, 2025 17:41

Add in-order flag and implementation

f4b38db

handle snapshotting edge case

43622b5

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 20, 2025

add warning log in state_dict call

57ba9e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stateful DL] Add out of order implementation #1423

[Stateful DL] Add out of order implementation #1423

michael-diggin commented Jan 20, 2025 •

edited

Loading

facebook-github-bot commented Jan 20, 2025

facebook-github-bot commented Jan 20, 2025

andrewkho commented Jan 22, 2025

pytorch-bot bot commented Jan 22, 2025 •

edited

Loading

michael-diggin commented Jan 22, 2025

[Stateful DL] Add out of order implementation #1423

Are you sure you want to change the base?

[Stateful DL] Add out of order implementation #1423

Conversation

michael-diggin commented Jan 20, 2025 • edited Loading

Changes

facebook-github-bot commented Jan 20, 2025

Action Required

Process

facebook-github-bot commented Jan 20, 2025

andrewkho commented Jan 22, 2025

pytorch-bot bot commented Jan 22, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1423

✅ You can merge normally! (13 Unrelated Failures)

michael-diggin commented Jan 22, 2025

michael-diggin commented Jan 20, 2025 •

edited

Loading

pytorch-bot bot commented Jan 22, 2025 •

edited

Loading