-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsdp::all_gather_copy_in
not currently implemented for the XPU device
#1328
Comments
fsdp::all_gather_copy_in
not currently implemented for the XPU device
Update with
|
Okay, I was able to get past the Instead, now I see the following warning: /lus/flare/projects/Aurora_deployment/foremans/micromamba/envs/anl_2024_12_release_2/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/__init__.py:25: UserWarning: Warning: Cannot load xpu CCL. CCL doesn't work for XPU device due to libintel-ext-pt-gpu.so: cannot open shared object file: No such file or directory
warnings.warn(f"Warning: Cannot load xpu CCL. CCL doesn't work for XPU device due to {e}") Before finally crashing with: [2025-01-28 10:06:31][I][ezpz/dist:831] Using device='xpu' with backend='DDP' + 'ccl' for distributed training.
[2025-01-28 10:06:31][I][ezpz/dist:877] ['x4204c5s3b0n0'][ 0/47]
[2025-01-28 10:06:31][I][ezpz/test_dist:369:__main__] model=
Network(
(layers): Sequential(
(0): Linear(in_features=128, out_features=1024, bias=True)
(1): Linear(in_features=1024, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=256, bias=True)
(3): Linear(in_features=256, out_features=128, bias=True)
(4): Linear(in_features=128, out_features=128, bias=True)
)
)
[rank38]: Traceback (most recent call last):
[rank38]: File "/lus/flare/projects/Aurora_deployment/foremans/micromamba/envs/anl_2024_12_release_2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank38]: return _run_code(code, main_globals, None,
[rank38]: File "/lus/flare/projects/Aurora_deployment/foremans/micromamba/envs/anl_2024_12_release_2/lib/python3.10/runpy.py", line 86, in _run_code
[rank38]: exec(code, run_globals)
[rank38]: File "/lus/flare/projects/Aurora_deployment/foremans/projects/saforem2/mmm/deps/ezpz/src/ezpz/test_dist.py", line 420, in <module>
[rank38]: trainer = main()
[rank38]: File "/lus/flare/projects/Aurora_deployment/foremans/projects/saforem2/mmm/deps/ezpz/src/ezpz/test_dist.py", line 405, in main
[rank38]: trainer = train(config)
[rank38]: File "/lus/flare/projects/Aurora_deployment/foremans/projects/saforem2/mmm/deps/ezpz/src/ezpz/test_dist.py", line 214, in train
[rank38]: model, optimizer = build_model_and_optimizer(model, backend=config.backend)
[rank38]: File "/lus/flare/projects/Aurora_deployment/foremans/projects/saforem2/mmm/deps/ezpz/src/ezpz/test_dist.py", line 373, in build_model_and_optimizer
[rank38]: model = DDP(model, device_ids=[ezpz.get_local_rank()])
[rank38]: File "/flare/Aurora_deployment/foremans/projects/saforem2/mmm/venvs/anl_2024_12_release_2/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank38]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank38]: File "/flare/Aurora_deployment/foremans/projects/saforem2/mmm/venvs/anl_2024_12_release_2/lib/python3.10/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes
[rank38]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank38]: RuntimeError: oneccl_bindings_for_pytorch: allgather isn't implementd on backend [xpu]. |
@saforem2 : I think that
To check that XCCL is available at runtime:
|
oh awesome, thank you for this! I will work on testing with your changes and report back |
@saforem2 We have implemented all collectives in XCCL backend, please try latest stock PyTorch and |
🚀 The feature, motivation and pitch
FSDP All Gather Copy not Implemented on XPU Device
Overview
I'm working on trying to run the
full_finetune_distributed
recipe from pytorch / torchtune I receive the followingNotImplementedError
:Python and Pytorch Info
Full command and output:
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: