-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Issues: NVIDIA/Megatron-LM
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it
#1369
opened Jan 30, 2025 by
TeddLi
[BUG] BERT and GPT345 Model Checkpoints Returning
410 Gone
HTTP Response
#1367
opened Jan 28, 2025 by
GangGreenTemperTatum
[QUESTION] The dataset cannot be found in multi-node multi-GPU training.
#1355
opened Jan 13, 2025 by
stay88
[BUG] When trying to convert llama2-7b model from HF format to megatron format
#1348
opened Jan 6, 2025 by
Sun2018421
[QUESTION]How to convert the weight file format of the MAMBA model from pt to safetensors format?
#1339
opened Dec 26, 2024 by
fxnie
[QUESTION]How can I load a checkpoint trained by Megatron-LM 0.5 into Megatron-LM 0.7 to resume pretraing?
#1333
opened Dec 22, 2024 by
IgorZan
[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing
#1330
opened Dec 20, 2024 by
thuwzt
[BUG]megatron-lm,with torchompile,The provided qkv memory layout is not supported!
#1329
opened Dec 20, 2024 by
qingshanxwx
[QUESTION] Why doesn't GPTDataset build a global shuffle index?
#1328
opened Dec 20, 2024 by
dynamicheart
[BUG] Precision issue caused by different token dispatchers in MoE training
#1327
opened Dec 17, 2024 by
qi7kuo
Previous Next
ProTip!
Exclude everything labeled
bug
with -label:bug.