Skip to content
/ MoE Public

some mixture of experts architecture implementations

License

Notifications You must be signed in to change notification settings

swiss-ai/MoE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoE

Some Mixture of Experts implementations :)

Overview

The idea of this repo is just to have simple implementations of MoEs in one place, both as an overview and for easy access. We focus on MoEs for large language models (in medium-sized GPT-2s), where they are used to replace the standard feedforward layers in transformers. Plug and play with it inside our modular llm-baselines codebase, which extends nanoGPT with different datasets!

For a broader overview of MoEs in the LLM context, see our shared doc.

Currently implemented:

  • Classical linear gating with softmax + top-k
  • Expert choice routing (paper)

We have preliminary results on small model pretraining (~65M-250M params, Mixtral style MoE) on different datasets that show a performance improvement similar to a double-depth (double-param) model; all while keeping the FLOPS close to the base dense model (top-2 routing).

Files

The files are the following:

gpt.py      # contains the standard transformer base architecture (GPT-2 style, similar to nanoGPT)
moe.py      # contains the mixture of experts block
aux_losses.py # the typical load balancing losses used for MoEs

Contact

If you are interested in this effort, please reach out to us on the the Swiss AI slack :)

Alex Hägele ([email protected]), Martin Jaggi ([email protected]).

About

some mixture of experts architecture implementations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages