Implement Mixture of Depth and Experts (MoDE) #103

casper-hansen · 2024-04-05T12:02:48Z

Given that MegaBlocks is highly optimized for sparse MoE models like Mixtral, I am requesting support for a variant recently termed as MoDE by Google DeepMind. Benefits include much faster training and inference due to increased sparsity.

Paper: https://arxiv.org/abs/2404.02258

I found two implementations:

ehartford · 2024-04-05T20:57:24Z

Very interested in this

mvpatel2000 · 2024-04-07T15:29:28Z

We'd love community PRs for this! Happy to help review and design. It's not currently on our roadmap, but we are evaluating it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Mixture of Depth and Experts (MoDE) #103

Implement Mixture of Depth and Experts (MoDE) #103

casper-hansen commented Apr 5, 2024

ehartford commented Apr 5, 2024

mvpatel2000 commented Apr 7, 2024

Implement Mixture of Depth and Experts (MoDE) #103

Implement Mixture of Depth and Experts (MoDE) #103

Comments

casper-hansen commented Apr 5, 2024

ehartford commented Apr 5, 2024

mvpatel2000 commented Apr 7, 2024