microsoft / DeepSpeed Public

Notifications You must be signed in to change notification settings
Fork 4.2k
Star 36.4k

Code
Issues 989
Pull requests 103
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: microsoft/DeepSpeed

[Roadmap] DeepSpeed Roadmap Q1 2025

#6946 opened Jan 13, 2025 by loadams

Open

Labels 30 Milestones 0

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

989 Open 1,952 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG] Invalidate trace cache warning bug

Something isn't working

training

#6985 opened Jan 30, 2025 by leachim

[BUG] pdsh runner doesn't work with tqdm bar bug

Something isn't working

training

#6978 opened Jan 29, 2025 by Superskyyy

[BUG] Errors in GPT-MoE models Inferences bug

Something isn't working

inference

#6973 opened Jan 25, 2025 by 1155157110

[BUG] libaio on amd node bug

Something isn't working

training

#6972 opened Jan 25, 2025 by GuanhuaWang

GPUUtil-0 remains 0 during the process loading a 72B model

#6970 opened Jan 24, 2025 by NivinaNull

[BUG] the input variables may be changed to scalars when use activation checkpoint bug

Something isn't working

training

#6969 opened Jan 23, 2025 by zhangvia

[BUG] z3+compile+gradient checkpoint uses more memory bug

Something isn't working

training

#6966 opened Jan 22, 2025 by oraluben

[BUG] deepspeed fails with torch 2.5 due to module._parameters is a dict, no longer a OrderedDict bug

Something isn't working

training

#6961 opened Jan 20, 2025 by skydoorkai

Is "Hierarchical All-to-all" feat available in current version?

#6957 opened Jan 16, 2025 by GalanPei

[REQUEST] FPDT backward test enhancement

New feature or request

#6955 opened Jan 16, 2025 by YizhouZ

[REQUEST] Pipeline Parallelism support multi optimizer to train enhancement

New feature or request

#6951 opened Jan 15, 2025 by whcjb

[BUG] model(**input) cannot use under zero stage 3. bug

Something isn't working

training

#6949 opened Jan 14, 2025 by MarkDeng1

[Roadmap] DeepSpeed Roadmap Q1 2025 roadmap

Roadmap direction for DeepSpeed

#6946 opened Jan 13, 2025 by loadams

5 tasks

nv-nightly CI test failure ci-failure

#6935 opened Jan 9, 2025 by github-actions bot

[BUG] deepspeed.initialize changes the output of Llama model bug

Something isn't working

training

#6929 opened Jan 7, 2025 by Ktakuya332C

Multi node multi gpu distributed load enhancement

New feature or request

#6927 opened Jan 6, 2025 by rastinrastinii

[BUG]Zero++ training failed bug

Something isn't working

training

#6926 opened Jan 6, 2025 by HelloWorld506

[REQUEST] Deepspeed Inference Supports VL (vision language) model enhancement

New feature or request

#6917 opened Dec 26, 2024 by ethen8181

[BUG] Cannot access local variable 'locations' where it is not associated with a value bug

Something isn't working

training

#6913 opened Dec 25, 2024 by Guodanding

[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun bug

Something isn't working

training

#6911 opened Dec 24, 2024 by dawnik17

[BUG] RuntimeError: The size of tensor a (2048) must match the size of tensor b (1024) at non-singleton dimension 2 bug

Something isn't working

deepspeed-chat

Related to DeepSpeed-Chat

#6910 opened Dec 24, 2024 by Lowlowlowlowlowlow

[REQUEST] Support for XLA/TPU enhancement

New feature or request

#6901 opened Dec 21, 2024 by radna0

prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted).

#6896 opened Dec 19, 2024 by fabiogeraci

How to perform inference MoE model with expert parallel

#6891 opened Dec 18, 2024 by Guodanding

Using zero3 on multiple nodes is slow bug

Something isn't working

training

#6889 opened Dec 18, 2024 by HelloWorld506

Previous 1 2 3 4 5 … 39 40 Next

Previous Next

ProTip! Exclude everything labeled bug with -label:bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly