Some Mixture of Experts implementations :)
The idea of this repo is just to have simple implementations of MoEs in one place, both as an overview and for easy access. We focus on MoEs for large language models (in medium-sized GPT-2s), where they are used to replace the standard feedforward layers in transformers. Plug and play with it inside our modular llm-baselines codebase, which extends nanoGPT with different datasets!
For a broader overview of MoEs in the LLM context, see our shared doc.
Currently implemented:
- Classical linear gating with softmax + top-k
- Expert choice routing (paper)
We have preliminary results on small model pretraining (~65M-250M params, Mixtral style MoE) on different datasets that show a performance improvement similar to a double-depth (double-param) model; all while keeping the FLOPS close to the base dense model (top-2 routing).
The files are the following:
gpt.py # contains the standard transformer base architecture (GPT-2 style, similar to nanoGPT)
moe.py # contains the mixture of experts block
aux_losses.py # the typical load balancing losses used for MoEs
If you are interested in this effort, please reach out to us on the the Swiss AI slack :)
Alex Hägele ([email protected]), Martin Jaggi ([email protected]).