-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Support Codestral Mamba #8519
Comments
I love the shout-out in the linked blog post!
That's a really nice nod -- love to see it! |
#7727 should cover for this model, but with untied embeddings unlike the other Mamba2 models. |
FYI, there is an "ngroups" param that changes how layer norm is done : https://github.com/state-spaces/mamba/blob/c0a00bd1808881831ddf43206c69362d4df90cf7/mamba_ssm/modules/mamba2.py#L47 We use ngroups=8. If you forget it or try with ngroups = 1 you'll have a bad time. Good luck ! |
After we merge #8526 we should try to add full support for this model. cc @compilade |
I'd love this. |
thanks! |
Hey guys, any progress on ETA for it? |
Some progress report: I have a local branch (not yet public) on top of #8526 in which I've started implementing the graph for Mamba-2. The conv step is very similar to Mamba-1, and I've started to implement the SSM step and will continue in the next days. It's not in a usable state yet. I'm starting by implementing the fully recurrent mode of Mamba-2 (which is very similar to Mamba-1) (and which is described in Section 3.4.1). But I'm still evaluating how the block decomposition would fit within how For the ETA, I'll try to get it working before the end of August, but no promises. (and BTW @rmusser01, #8980 is waiting on #8526, not the other way around, at least I think?) |
Okay, the fully recurrent mode works for Note that Mamba-Codestral-7B-v0.1 cannot be converted as-is; either use https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/discussions/9, or rename The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes A big downside right now with recurrent models in The implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context. Just making sure expectations are not too far from reality. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Feature Description
New 7B coding model just released by Mistral.
Motivation
Seems to perform very well, especially for a 7B model:
Possible Implementation
An extension to #7727?
The text was updated successfully, but these errors were encountered: