Feature Request: Support Codestral Mamba #8519

VelocityRa · 2024-07-16T16:22:42Z

Feature Description

New 7B coding model just released by Mistral.

Blog Post: https://mistral.ai/news/codestral-mamba/
HF: https://huggingface.co/mistralai/mamba-codestral-7B-v0.1

Motivation

Seems to perform very well, especially for a 7B model:

Possible Implementation

An extension to #7727?

HanClinto · 2024-07-16T16:45:34Z

I love the shout-out in the linked blog post!

You can deploy Codestral Mamba using the mistral-inference SDK, which relies on the reference implementations from Mamba’s GitHub repository. The model can also be deployed through TensorRT-LLM. For local inference, keep an eye out for support in llama.cpp. You may download the raw weights from HuggingFace.

That's a really nice nod -- love to see it!

theo77186 · 2024-07-16T18:37:43Z

#7727 should cover for this model, but with untied embeddings unlike the other Mamba2 models.

timlacroix · 2024-07-17T11:41:21Z

FYI, there is an "ngroups" param that changes how layer norm is done : https://github.com/state-spaces/mamba/blob/c0a00bd1808881831ddf43206c69362d4df90cf7/mamba_ssm/modules/mamba2.py#L47

We use ngroups=8. If you forget it or try with ngroups = 1 you'll have a bad time.

Good luck !

ggerganov · 2024-07-17T11:54:35Z

After we merge #8526 we should try to add full support for this model. cc @compilade

0wwafa · 2024-07-17T15:52:39Z

I'd love this.

txhno · 2024-07-18T11:06:58Z

thanks!

fredconex · 2024-07-29T09:55:22Z

Hey guys, any progress on ETA for it?

rmusser01 · 2024-08-17T02:41:00Z

For anyone else, seems this is waiting on
#8526
which is waiting on
#8980 -> which is waiting on review(?).

compilade · 2024-08-17T02:52:56Z

Some progress report: I have a local branch (not yet public) on top of #8526 in which I've started implementing the graph for Mamba-2. The conv step is very similar to Mamba-1, and I've started to implement the SSM step and will continue in the next days. It's not in a usable state yet.

I'm starting by implementing the fully recurrent mode of Mamba-2 (which is very similar to Mamba-1) (and which is described in Section 3.4.1).

But I'm still evaluating how the block decomposition would fit within how src/llama.cpp manages batches and/or if the chunk size should be dynamic. It seems like to fully benefit from Section 6, the chunks should be smaller than the batch size, but not too small, at which point directly doing the recurrence is the same. Since the ggml compute graph nodes should keep the same structure between batches and that the block decomposition will likely have too much overhead for small batches, it's easier to simply go with the linear recurrence with something like ggml_ssm_scan at first.

For the ETA, I'll try to get it working before the end of August, but no promises.

(and BTW @rmusser01, #8980 is waiting on #8526, not the other way around, at least I think?)

compilade · 2024-08-19T03:23:21Z

Okay, the fully recurrent mode works for Mamba-2! (for the curious, see this branch: https://github.com/compilade/llama.cpp/tree/compilade/mamba2) I'll open a PR soon (in the next days; still need to clean up some things).

Note that Mamba-Codestral-7B-v0.1 cannot be converted as-is; either use https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/discussions/9, or rename consolidated.safetensors to model.safetensors, tokenizer.model.v3 to tokenizer.model, and params.json to config.json. Then, in config.json, the line "architectures": ["Mamba2ForCausalLM"], needs to be added (if missing).

The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes 263.5 MiB (in F32) per sequence (e.g. with -np 1), compared to 38 MiB for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size.

A big downside right now with recurrent models in llama.cpp is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if using llama-server. I think using llama-cli in conversation mode does not have this problem, however (or maybe only the bare interactive mode with --in-prefix and --in-suffix, not sure).

The implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of Mamba-2-130M is similar or better than Mamba-130M (but still not that fast compared to transformer-based models with an empty context).

The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.

Just making sure expectations are not too far from reality.

github-actions · 2024-10-04T01:07:22Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions · 2024-11-18T01:07:51Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

VelocityRa added the enhancement New feature or request label Jul 16, 2024

NatanFreeman mentioned this issue Jul 16, 2024

Mistral Codestral Mamba 7B ollama/ollama#5725

Open

rick-github mentioned this issue Jul 22, 2024

codestral 7b ollama/ollama#5845

Closed

dlippold mentioned this issue Jul 23, 2024

Add support for mamba-codestral oobabooga/text-generation-webui#6249

Closed

ggerganov mentioned this issue Jul 24, 2024

llama : simplify Mamba with advanced batch splits #8526

Merged

10 tasks

NextGenOP mentioned this issue Jul 28, 2024

Bug: Can't quantize in gguf q5_k_m a mamba architecture codestral #8690

Closed

rick-github mentioned this issue Jul 30, 2024

add mamba ollama/ollama#6076

Open

dlippold mentioned this issue Aug 18, 2024

So SLOW with NVidia GPU and Codestral model oobabooga/text-generation-webui#6326

Open

1 task

compilade linked a pull request Aug 21, 2024 that will close this issue

llama : initial Mamba-2 support #9126

Open

9 tasks

github-actions bot added the stale label Sep 19, 2024

github-actions bot closed this as completed Oct 4, 2024

Galunid reopened this Oct 4, 2024

compilade removed the stale label Oct 4, 2024

github-actions bot added the stale label Nov 4, 2024

github-actions bot closed this as completed Nov 18, 2024

compilade removed the stale label Nov 18, 2024

compilade reopened this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support Codestral Mamba #8519

Feature Request: Support Codestral Mamba #8519

VelocityRa commented Jul 16, 2024 •

edited

Loading

HanClinto commented Jul 16, 2024

theo77186 commented Jul 16, 2024

timlacroix commented Jul 17, 2024

ggerganov commented Jul 17, 2024

0wwafa commented Jul 17, 2024

txhno commented Jul 18, 2024

fredconex commented Jul 29, 2024

rmusser01 commented Aug 17, 2024

compilade commented Aug 17, 2024 •

edited

Loading

compilade commented Aug 19, 2024 •

edited

Loading

github-actions bot commented Oct 4, 2024

github-actions bot commented Nov 18, 2024

Feature Request: Support Codestral Mamba #8519

Feature Request: Support Codestral Mamba #8519

Comments

VelocityRa commented Jul 16, 2024 • edited Loading

Feature Description

Motivation

Possible Implementation

HanClinto commented Jul 16, 2024

theo77186 commented Jul 16, 2024

timlacroix commented Jul 17, 2024

ggerganov commented Jul 17, 2024

0wwafa commented Jul 17, 2024

txhno commented Jul 18, 2024

fredconex commented Jul 29, 2024

rmusser01 commented Aug 17, 2024

compilade commented Aug 17, 2024 • edited Loading

compilade commented Aug 19, 2024 • edited Loading

github-actions bot commented Oct 4, 2024

github-actions bot commented Nov 18, 2024

VelocityRa commented Jul 16, 2024 •

edited

Loading

compilade commented Aug 17, 2024 •

edited

Loading

compilade commented Aug 19, 2024 •

edited

Loading