MoE

Some Mixture of Experts implementations :)

Overview

The idea of this repo is just to have simple implementations of MoEs in one place, both as an overview and for easy access. We focus on MoEs for large language models (in medium-sized GPT-2s), where they are used to replace the standard feedforward layers in transformers. Plug and play with it inside our modular llm-baselines codebase, which extends nanoGPT with different datasets!

For a broader overview of MoEs in the LLM context, see our shared doc.

Currently implemented:

Classical linear gating with softmax + top-k
Expert choice routing (paper)

We have preliminary results on small model pretraining (~65M-250M params, Mixtral style MoE) on different datasets that show a performance improvement similar to a double-depth (double-param) model; all while keeping the FLOPS close to the base dense model (top-2 routing).

Files

The files are the following:

gpt.py      # contains the standard transformer base architecture (GPT-2 style, similar to nanoGPT)
moe.py      # contains the mixture of experts block
aux_losses.py # the typical load balancing losses used for MoEs

Contact

If you are interested in this effort, please reach out to us on the the Swiss AI slack :)

Alex Hägele ([email protected]), Martin Jaggi ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aux_losses.py		aux_losses.py
gpt.py		gpt.py
moe.py		moe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoE

Overview

Files

Contact

About

Releases

Packages

Contributors 2

Languages

License

swiss-ai/MoE

Folders and files

Latest commit

History

Repository files navigation

MoE

Overview

Files

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages