-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model seeing a black image even when there's no image attached. (Molmo 7B D) #28
Comments
The model's training is image focused unfortunately, and without one will just error, so I provide a black pixel. It's a very specialized fine tune and doesn't handle non-image chat well. 4 T/s does seem slow. I see about 8GB VRAM usage when I run it that way, and get about 25 T/s. Could it be loading in CPU? I don't know windows/wsl at all really, so I may not be able to help you much, but nvidia-smi should show you GPU vram usage. |
Just an idea, with open-webui you can switch the model part way though the conversation, so you could start with no image on one model, and later switch models to the image model when you add the image. |
Hmm, got it. But Ubuntu-WSL is using ALL of the gpu's 12GB and more 4GB of my RAM, I REALLY think it is something about context length. Any way I can change the max context length mannually? |
Oh thank you! That's a clever workaround :) |
I don't have an option for this yet, but it's a good idea, I really should. Are you sure you're loading with --load-in-4bit? without 4bit I would expect about 17GB of usage. #CLI_COMMAND="python vision.py -m allenai/Molmo-7B-D-0924 -A flash_attention_2 --load-in-4bit --use-double-quant" # test pass✅, time: 45.2s, mem: 7.7GB, 13/13 tests passed, (318/14.5s) 21.9 T/s
#CLI_COMMAND="python vision.py -m allenai/Molmo-7B-D-0924 -A flash_attention_2 --load-in-4bit" # test pass✅, time: 38.6s, mem: 8.1GB, 13/13 tests passed, (310/12.3s) 25.2 T/s |
Oh, molmo-7B-D-bnb-4bit, I couldn't get that to load properly so I left it out of the supported list, try again without the --load-in-4bit maybe? but As I said, I couldn't get it to load properly. |
it worked, i just had to use the default model, without being already quantized! i got 15 T/s now, a lot more usable. thank you!! |
For a while now I've been trying to get this project to work locally on Windows, and after a lot of work, I decided to just run it using Ubuntu-WSL. It works fine, but with some inconveniences, as shown in the image. I just want to chat with only text initially, then afterwards add images.
Also, the inference is REALLY slow (4 T/s), and I don't really think it is because of my hardware (Maybe it is something about context length, but I'm not sure).
I would appreciate some help on those issues.
Environment:
OS: Ubuntu-WSL (I already tried running on Docker/Windows and faced problems)
Model: Molmo 7B D
Quantization: BNB 4bit
Hardware:
GPU: 3060 12GB
RAM: 16GB
The text was updated successfully, but these errors were encountered: