Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For model-parallel&multi-gpu training and inference #448

Open
hiun opened this issue Sep 27, 2020 · 0 comments
Open

For model-parallel&multi-gpu training and inference #448

hiun opened this issue Sep 27, 2020 · 0 comments

Comments

@hiun
Copy link

hiun commented Sep 27, 2020

Hi,

  1. I am looking for materials for how GPU memory in DeepSpeed is used for model-parallel&multi-gpu training setting (=means all weights are not fit into single GPU memory even DeepSpeed is applied).

For my current understanding, the following post only visualizes data-parallel&multi-gpu setting. https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/. Any reference materials are appreciated to understand the internals. (both DeepSpeed doc or outside doc is ok).

image

  1. Is DeepSpeed provides inference or serving API for model-parallel&multi-gpu environment? (Because in model-parallel setting, weights are partitioned, I think this is non-trivial to serve. I currently looking for using typical inference function in the training process, however.)

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant