We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the router implemented the noisy top k routing suggested by the OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER paper?
In the router code you seem to apply the noise at the input of the router and not at the router scores like in the paper above:
def forward(self, x): if self.training and self.args.moe_jitter_eps is not None: x = x * self.jitter(x) scores = self.layer(x.view(-1, x.shape[-1])).softmax(dim=-1) expert_weights, expert_indices = self._top_k(scores) if self.args.moe_normalize_expert_weights: expert_weights = expert_weights / torch.norm( expert_weights, p=self.args.moe_normalize_expert_weights,dim=-1, keepdim=True) expert_indices = ( _uniform_expert_assignment(expert_indices, self.args.moe_num_experts) if self.args.uniform_expert_assignment else expert_indices ) return scores, expert_weights, expert_indices
In the aforementioned paper the noisy top k works like:
Is this somehting equivalent? I am not trying to argue that it is wrong, but i was just trying to figure out if this is the same.
The text was updated successfully, but these errors were encountered:
@tgale96 what do you think since you implemented this? It does seem different to me but not sure if it was pulled from some other paper
Sorry, something went wrong.
No branches or pull requests
Is the router implemented the noisy top k routing suggested by the OUTRAGEOUSLY LARGE NEURAL NETWORKS:
THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER paper?
In the router code you seem to apply the noise at the input of the router and not at the router scores like in the paper above:
In the aforementioned paper the noisy top k works like:
Is this somehting equivalent? I am not trying to argue that it is wrong, but i was just trying to figure out if this is the same.
The text was updated successfully, but these errors were encountered: