-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tecorigin sdaa accelerator #6903
base: master
Are you sure you want to change the base?
Tecorigin sdaa accelerator #6903
Conversation
@siqi654321, is this ready for review? |
@tjruwase Yes, it's ready for review. And Tecorigin SDAA is a AI processor that support AI frameworks like PyTorch, etc. It‘s possible to run Transformers/Accelerate/DeepSpeed on SDAA to train foundation model. Website: http://www.tecorigin.com/ |
@siqi654321, please address the DCO requirement |
614eea9
to
82ca0ea
Compare
Signed-off-by: siqi <[email protected]>
Signed-off-by: siqi <[email protected]>
82ca0ea
to
38e4d50
Compare
@tjruwase The DCO problem has been solved. |
@siqi654321, please consider doing the following as appropriate:
@loadams, did I miss anything? |
@siqi654321, the following can help fix the Formatting issue |
Signed-off-by: siqi <[email protected]>
@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary. I see that some accelerators, such as mlu, do not provide these documents. |
Signed-off-by: siqi <[email protected]>
@siqi654321, the additional documentation is completely optional. This PR will merge once CI completes. |
@tjruwase I found that the xpu-max1100 test has failed, but it doesn't seem to be related to my changes. Could you help take a look at this issue? |
@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks! https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903 |
I think this covers everything, the README and the accelerator setup guide being the most important. |
@tjruwase we are updating firmware of this server to see whether the random failure goes away, at this time please ignore this error. Thanks! |
@tjruwase The tests "xpu-max1100" and "nv-torch-latest-v100" appear to be executed on the "xpu" and "cuda" accelerators respectively. However, I have confirmed that my modifications did not impact these two accelerators, which is quite puzzling. Is it possible to run the tests on a different CI machine? Or are there any other methods I can assist with to help troubleshoot the issue? |
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.