NVIDIA / Megatron-LM Public

Notifications You must be signed in to change notification settings
Fork 2.5k
Star 11.2k

Code
Issues 174
Pull requests 155
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

174 Open 655 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[ENHANCEMENT] Support pre-built wheels for Python 3.12

#1370 opened Jan 30, 2025 by kevalmorabia97

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it

#1369 opened Jan 30, 2025 by TeddLi

[BUG] FSDP2 activation recomputation does not save memory

#1368 opened Jan 28, 2025 by janEbert

[BUG] BERT and GPT345 Model Checkpoints Returning 410 Gone HTTP Response

#1367 opened Jan 28, 2025 by GangGreenTemperTatum

[QUESTION]convert LLaMA2-7B to the Megatron format failed: the converted model only repeats meaningless numbers

#1365 opened Jan 22, 2025 by carrot0117

[QUESTION] How can I train a model from hugging face

#1364 opened Jan 22, 2025 by JavaZeroo

[QUESTION] The dataset cannot be found in multi-node multi-GPU training.

#1355 opened Jan 13, 2025 by stay88

[QUESTION] Limit Number of Saved Checkpoints

#1354 opened Jan 13, 2025 by GuokunWang

[BUG]

#1353 opened Jan 11, 2025 by lawchingman

[BUG] can't load saved fp8 checkpoint when resume training

#1350 opened Jan 8, 2025 by switiz

[BUG] Using fp16 uses more memory than using fp32

#1349 opened Jan 8, 2025 by eliird

[BUG] When trying to convert llama2-7b model from HF format to megatron format

#1348 opened Jan 6, 2025 by Sun2018421

[QUESTION] Typo in MoE README

#1346 opened Jan 4, 2025 by rgtjf

[QUESTION] Resume training about dataset

#1343 opened Jan 2, 2025 by JiwenJ

[QUESTION] Expert Parallelism with Non-Identical Experts

#1342 opened Jan 1, 2025 by kevin3567

[QUESTION]"a2a+p2p" for context parallel(cp)

#1341 opened Dec 27, 2024 by heavyrain-lzy

[QUESTION]How to convert the weight file format of the MAMBA model from pt to safetensors format?

#1339 opened Dec 26, 2024 by fxnie

[QUESTION] Why mixral use Llama2Tokenizer?

#1338 opened Dec 25, 2024 by DemingCheng

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?

#1335 opened Dec 24, 2024 by renyinCheng001

[QUESTION]How can I load a checkpoint trained by Megatron-LM 0.5 into Megatron-LM 0.7 to resume pretraing?

#1333 opened Dec 22, 2024 by IgorZan

[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing

#1330 opened Dec 20, 2024 by thuwzt

[BUG]megatron-lm，with torchompile，The provided qkv memory layout is not supported!

#1329 opened Dec 20, 2024 by qingshanxwx

[QUESTION] Why doesn't GPTDataset build a global shuffle index?

#1328 opened Dec 20, 2024 by dynamicheart

[BUG] Precision issue caused by different token dispatchers in MoE training

#1327 opened Dec 17, 2024 by qi7kuo

[QUESTION] About using StreamingLLM

#1326 opened Dec 17, 2024 by zhangyilalala

Previous 1 2 3 4 5 6 7 Next

Previous Next

ProTip! Exclude everything labeled bug with -label:bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly