Routing #118

alexliap · 2024-06-27T19:22:54Z

Is the router implemented the noisy top k routing suggested by the OUTRAGEOUSLY LARGE NEURAL NETWORKS:
THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER paper?

In the router code you seem to apply the noise at the input of the router and not at the router scores like in the paper above:

 def forward(self, x):
        if self.training and self.args.moe_jitter_eps is not None:
            x = x * self.jitter(x)

        scores = self.layer(x.view(-1, x.shape[-1])).softmax(dim=-1)
        expert_weights, expert_indices = self._top_k(scores)
        if self.args.moe_normalize_expert_weights:
            expert_weights = expert_weights / torch.norm(
                expert_weights, p=self.args.moe_normalize_expert_weights,dim=-1, keepdim=True)

        expert_indices = (
            _uniform_expert_assignment(expert_indices, self.args.moe_num_experts)
            if self.args.uniform_expert_assignment else expert_indices
        )
        return scores, expert_weights, expert_indices

In the aforementioned paper the noisy top k works like:

Is this somehting equivalent? I am not trying to argue that it is wrong, but i was just trying to figure out if this is the same.

mvpatel2000 · 2024-06-28T22:53:40Z

@tgale96 what do you think since you implemented this? It does seem different to me but not sure if it was pulled from some other paper

alexliap changed the title ~~Routing~~ Routing #question Jun 27, 2024

alexliap changed the title ~~Routing #question~~ Routing Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Routing #118

Routing #118

alexliap commented Jun 27, 2024

mvpatel2000 commented Jun 28, 2024

Routing #118

Routing #118

Comments

alexliap commented Jun 27, 2024

mvpatel2000 commented Jun 28, 2024