-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Bunches of Issues in Mamba and Mamba2 #90
Comments
Probably also related to state-spaces/mamba#641 |
Maybe a bit late but I try to clarify some stuff:
I also added a bunch of fixes recently in transformers which probably should be ported over here as well. No idea when I have time but others can gladly take over as well! |
Before I forget it, reference for fixes: huggingface/transformers#35154 |
@vasqu Thank very much for answering. |
No idea tbh, it relates to context parallelism in a way as well so it does have its benefits for multiple purposes - inference and training. Not too familiar with the details around context parallelism but shouldn't only the mamba and conv ops pose problems, i.e. they need special treatment. I think we'd need passing of initial states for the mamba op and the conv "cache" for the conv. Should be feasible imo. |
Describe the bug
Mamba and Mamba2 There is obvious difference between inference as a whole and chunk-wise inference.
In Mamba2, Only Hidden_Size=2048 will work, others will trigger error
What's the usage of cache_position ? Quite confusing
Steps to reproduce the bug
Got the following Output:
0 tensor(2.8154e-07, device='cuda:0', grad_fn=)
1 tensor(3285.6797, device='cuda:0', grad_fn=)
2 tensor(687.7205, device='cuda:0', grad_fn=)
3 tensor(801.7145, device='cuda:0', grad_fn=)
4 tensor(772.4307, device='cuda:0', grad_fn=)
5 tensor(688.4492, device='cuda:0', grad_fn=)
6 tensor(690.9346, device='cuda:0', grad_fn=)
7 tensor(897.0207, device='cuda:0', grad_fn=)
Expected behavior
should be close to zero
Environment info
The text was updated successfully, but these errors were encountered: