Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tecorigin sdaa accelerator #6903

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

siqi654321
Copy link

Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

@tjruwase
Copy link
Contributor

@siqi654321, is this ready for review?

@siqi654321
Copy link
Author

@tjruwase Yes, it's ready for review. And Tecorigin SDAA is a AI processor that support AI frameworks like PyTorch, etc. It‘s possible to run Transformers/Accelerate/DeepSpeed on SDAA to train foundation model. Website: http://www.tecorigin.com/

@tjruwase
Copy link
Contributor

tjruwase commented Feb 6, 2025

@siqi654321, please address the DCO requirement
image

@siqi654321 siqi654321 force-pushed the Tecorigin-SDAA-accelerator branch from 82ca0ea to 38e4d50 Compare February 7, 2025 02:55
@siqi654321
Copy link
Author

@tjruwase The DCO problem has been solved.

@tjruwase
Copy link
Contributor

tjruwase commented Feb 7, 2025

@tjruwase
Copy link
Contributor

tjruwase commented Feb 7, 2025

@siqi654321, the following can help fix the Formatting issue
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Signed-off-by: siqi <[email protected]>
@siqi654321
Copy link
Author

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary. I see that some accelerators, such as mlu, do not provide these documents.

@tjruwase
Copy link
Contributor

tjruwase commented Feb 8, 2025

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary.

@siqi654321, the additional documentation is completely optional. This PR will merge once CI completes.

@siqi654321
Copy link
Author

@tjruwase I found that the xpu-max1100 test has failed, but it doesn't seem to be related to my changes. Could you help take a look at this issue?

@tjruwase
Copy link
Contributor

tjruwase commented Feb 11, 2025

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks!
image

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

@loadams
Copy link
Collaborator

loadams commented Feb 12, 2025

@siqi654321, please consider doing the following as appropriate:

  1. Reviewing https://www.deepspeed.ai/tutorials/accelerator-abstraction-interface/
  2. Updating https://www.deepspeed.ai/tutorials/accelerator-setup-guide/
  3. Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#contributed-hw-support
  4. Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#build-pipeline-status

@loadams, did I miss anything?

I think this covers everything, the README and the accelerator setup guide being the most important.

@delock
Copy link
Collaborator

delock commented Feb 12, 2025

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks! image

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

@tjruwase we are updating firmware of this server to see whether the random failure goes away, at this time please ignore this error. Thanks!

@siqi654321
Copy link
Author

@tjruwase The tests "xpu-max1100" and "nv-torch-latest-v100" appear to be executed on the "xpu" and "cuda" accelerators respectively. However, I have confirmed that my modifications did not impact these two accelerators, which is quite puzzling. Is it possible to run the tests on a different CI machine? Or are there any other methods I can assist with to help troubleshoot the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants