Tecorigin sdaa accelerator #6903

siqi654321 · 2024-12-23T02:21:38Z

Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

tjruwase · 2025-01-22T22:28:43Z

@siqi654321, is this ready for review?

siqi654321 · 2025-02-06T09:02:41Z

@tjruwase Yes, it's ready for review. And Tecorigin SDAA is a AI processor that support AI frameworks like PyTorch, etc. It‘s possible to run Transformers/Accelerate/DeepSpeed on SDAA to train foundation model. Website: http://www.tecorigin.com/

tjruwase · 2025-02-06T15:32:53Z

@siqi654321, please address the DCO requirement

Signed-off-by: siqi <[email protected]>

siqi654321 · 2025-02-07T03:04:59Z

@tjruwase The DCO problem has been solved.

tjruwase · 2025-02-07T11:26:57Z

@siqi654321, please consider doing the following as appropriate:

@loadams, did I miss anything?

tjruwase · 2025-02-07T11:27:46Z

@siqi654321, the following can help fix the Formatting issue
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Signed-off-by: siqi <[email protected]>

siqi654321 · 2025-02-08T02:14:34Z

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary. I see that some accelerators, such as mlu, do not provide these documents.

Signed-off-by: siqi <[email protected]>

tjruwase · 2025-02-08T09:07:59Z

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary.

@siqi654321, the additional documentation is completely optional. This PR will merge once CI completes.

siqi654321 · 2025-02-10T01:48:49Z

@tjruwase I found that the xpu-max1100 test has failed, but it doesn't seem to be related to my changes. Could you help take a look at this issue?

tjruwase · 2025-02-11T23:40:44Z

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks!

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

loadams · 2025-02-12T00:43:58Z

@siqi654321, please consider doing the following as appropriate:

Reviewing https://www.deepspeed.ai/tutorials/accelerator-abstraction-interface/

Updating https://www.deepspeed.ai/tutorials/accelerator-setup-guide/

Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#contributed-hw-support

Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#build-pipeline-status

@loadams, did I miss anything?

I think this covers everything, the README and the accelerator setup guide being the most important.

delock · 2025-02-12T05:30:37Z

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks!

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

@tjruwase we are updating firmware of this server to see whether the random failure goes away, at this time please ignore this error. Thanks!

siqi654321 · 2025-02-12T05:37:05Z

@tjruwase The tests "xpu-max1100" and "nv-torch-latest-v100" appear to be executed on the "xpu" and "cuda" accelerators respectively. However, I have confirmed that my modifications did not impact these two accelerators, which is quite puzzling. Is it possible to run the tests on a different CI machine? Or are there any other methods I can assist with to help troubleshoot the issue?

siqi654321 requested review from loadams, tjruwase and jomayeri as code owners December 23, 2024 02:21

deepspeedai deleted a comment from microsoft-github-policy-service bot Feb 5, 2025

siqi654321 force-pushed the Tecorigin-SDAA-accelerator branch from 614eea9 to 82ca0ea Compare February 7, 2025 02:16

siqi654321 requested review from tohtana, GuanhuaWang and hwchen2017 as code owners February 7, 2025 02:16

siqi added 2 commits February 7, 2025 10:50

[Accelerator] Tecorgin SDAA support

3f42a1d

Signed-off-by: siqi <[email protected]>

Add Tecorigin License

38e4d50

Signed-off-by: siqi <[email protected]>

siqi654321 force-pushed the Tecorigin-SDAA-accelerator branch from 82ca0ea to 38e4d50 Compare February 7, 2025 02:55

Merge branch 'master' into Tecorigin-SDAA-accelerator

03aecee

Fix format error

32e6162

Signed-off-by: siqi <[email protected]>

siqi and others added 2 commits February 8, 2025 10:38

fix formatting error

ab21cd7

Signed-off-by: siqi <[email protected]>

Merge branch 'master' into Tecorigin-SDAA-accelerator

943e073

Merge branch 'master' into Tecorigin-SDAA-accelerator

15e1b97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tecorigin sdaa accelerator #6903

Tecorigin sdaa accelerator #6903

siqi654321 commented Dec 23, 2024

tjruwase commented Jan 22, 2025

siqi654321 commented Feb 6, 2025

tjruwase commented Feb 6, 2025

siqi654321 commented Feb 7, 2025

tjruwase commented Feb 7, 2025

tjruwase commented Feb 7, 2025

siqi654321 commented Feb 8, 2025

tjruwase commented Feb 8, 2025

siqi654321 commented Feb 10, 2025

tjruwase commented Feb 11, 2025 •

edited

Loading

loadams commented Feb 12, 2025

delock commented Feb 12, 2025

siqi654321 commented Feb 12, 2025

Tecorigin sdaa accelerator #6903

Are you sure you want to change the base?

Tecorigin sdaa accelerator #6903

Conversation

siqi654321 commented Dec 23, 2024

tjruwase commented Jan 22, 2025

siqi654321 commented Feb 6, 2025

tjruwase commented Feb 6, 2025

siqi654321 commented Feb 7, 2025

tjruwase commented Feb 7, 2025

tjruwase commented Feb 7, 2025

siqi654321 commented Feb 8, 2025

tjruwase commented Feb 8, 2025

siqi654321 commented Feb 10, 2025

tjruwase commented Feb 11, 2025 • edited Loading

loadams commented Feb 12, 2025

delock commented Feb 12, 2025

siqi654321 commented Feb 12, 2025

tjruwase commented Feb 11, 2025 •

edited

Loading