Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model seeing a black image even when there's no image attached. (Molmo 7B D) #28

Open
fixu124 opened this issue Dec 7, 2024 · 8 comments

Comments

@fixu124
Copy link

fixu124 commented Dec 7, 2024

image

For a while now I've been trying to get this project to work locally on Windows, and after a lot of work, I decided to just run it using Ubuntu-WSL. It works fine, but with some inconveniences, as shown in the image. I just want to chat with only text initially, then afterwards add images.
Also, the inference is REALLY slow (4 T/s), and I don't really think it is because of my hardware (Maybe it is something about context length, but I'm not sure).

I would appreciate some help on those issues.

Environment:
OS: Ubuntu-WSL (I already tried running on Docker/Windows and faced problems)
Model: Molmo 7B D
Quantization: BNB 4bit

Hardware:
GPU: 3060 12GB
RAM: 16GB

@matatonic
Copy link
Owner

The model's training is image focused unfortunately, and without one will just error, so I provide a black pixel. It's a very specialized fine tune and doesn't handle non-image chat well.

4 T/s does seem slow. I see about 8GB VRAM usage when I run it that way, and get about 25 T/s. Could it be loading in CPU? I don't know windows/wsl at all really, so I may not be able to help you much, but nvidia-smi should show you GPU vram usage.

@matatonic
Copy link
Owner

Just an idea, with open-webui you can switch the model part way though the conversation, so you could start with no image on one model, and later switch models to the image model when you add the image.

@fixu124
Copy link
Author

fixu124 commented Dec 7, 2024

The model's training is image focused unfortunately, and without one will just error, so I provide a black pixel. It's a very specialized fine tune and doesn't handle non-image chat well.

4 T/s does seem slow. I see about 8GB VRAM usage when I run it that way, and get about 25 T/s. Could it be loading in CPU? I don't know windows/wsl at all really, so I may not be able to help you much, but nvidia-smi should show you GPU vram usage.

Hmm, got it. But Ubuntu-WSL is using ALL of the gpu's 12GB and more 4GB of my RAM, I REALLY think it is something about context length. Any way I can change the max context length mannually?

@fixu124
Copy link
Author

fixu124 commented Dec 7, 2024

Just an idea, with open-webui you can switch the model part way though the conversation, so you could start with no image on one model, and later switch models to the image model when you add the image.

Oh thank you! That's a clever workaround :)

@matatonic
Copy link
Owner

The model's training is image focused unfortunately, and without one will just error, so I provide a black pixel. It's a very specialized fine tune and doesn't handle non-image chat well.
4 T/s does seem slow. I see about 8GB VRAM usage when I run it that way, and get about 25 T/s. Could it be loading in CPU? I don't know windows/wsl at all really, so I may not be able to help you much, but nvidia-smi should show you GPU vram usage.

Hmm, got it. But Ubuntu-WSL is using ALL of the gpu's 12GB and more 4GB of my RAM, I REALLY think it is something about context length. Any way I can change the max context length mannually?

I don't have an option for this yet, but it's a good idea, I really should.

Are you sure you're loading with --load-in-4bit? without 4bit I would expect about 17GB of usage.

#CLI_COMMAND="python vision.py -m allenai/Molmo-7B-D-0924 -A flash_attention_2 --load-in-4bit --use-double-quant"  # test pass✅, time: 45.2s, mem: 7.7GB, 13/13 tests passed, (318/14.5s) 21.9 T/s
#CLI_COMMAND="python vision.py -m allenai/Molmo-7B-D-0924 -A flash_attention_2 --load-in-4bit"  # test pass✅, time: 38.6s, mem: 8.1GB, 13/13 tests passed, (310/12.3s) 25.2 T/s

@fixu124
Copy link
Author

fixu124 commented Dec 7, 2024

The model's training is image focused unfortunately, and without one will just error, so I provide a black pixel. It's a very specialized fine tune and doesn't handle non-image chat well.
4 T/s does seem slow. I see about 8GB VRAM usage when I run it that way, and get about 25 T/s. Could it be loading in CPU? I don't know windows/wsl at all really, so I may not be able to help you much, but nvidia-smi should show you GPU vram usage.

Hmm, got it. But Ubuntu-WSL is using ALL of the gpu's 12GB and more 4GB of my RAM, I REALLY think it is something about context length. Any way I can change the max context length mannually?

I don't have an option for this yet, but it's a good idea, I really should.

Are you sure you're loading with --load-in-4bit? without 4bit I would expect about 17GB of usage.

#CLI_COMMAND="python vision.py -m allenai/Molmo-7B-D-0924 -A flash_attention_2 --load-in-4bit --use-double-quant"  # test pass✅, time: 45.2s, mem: 7.7GB, 13/13 tests passed, (318/14.5s) 21.9 T/s
#CLI_COMMAND="python vision.py -m allenai/Molmo-7B-D-0924 -A flash_attention_2 --load-in-4bit"  # test pass✅, time: 38.6s, mem: 8.1GB, 13/13 tests passed, (310/12.3s) 25.2 T/s

I'm using this exact command: "python vision.py -m "/home/homemdesgraca/openedai-vision/models/molmo-7B-D-bnb-4bit" -A flash_attention_2 --load-in-4bit" and it is still using my full 12GB VRAM + 4GB RAM.
image

@matatonic
Copy link
Owner

Oh, molmo-7B-D-bnb-4bit, I couldn't get that to load properly so I left it out of the supported list, try again without the --load-in-4bit maybe? but As I said, I couldn't get it to load properly.

@fixu124
Copy link
Author

fixu124 commented Dec 7, 2024

Oh, molmo-7B-D-bnb-4bit, I couldn't get that to load properly so I left it out of the supported list, try again without the --load-in-4bit maybe? but As I said, I couldn't get it to load properly.

it worked, i just had to use the default model, without being already quantized! i got 15 T/s now, a lot more usable. thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants