-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keeping track of the performance and compatibility of models #147
Comments
This has some output a variety of models for versions:
There are quite a few glitches for various models, including deprecation warnings, the need to trust code, assertion errors, and some models are too slow to be practical.
The script to reproduce this is: from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
import subprocess
import time
import psutil
output = subprocess.check_output(
["/opt/homebrew/Caskroom/miniconda/base/envs/mlx/bin/huggingface-cli", "scan-cache"]
)
lines = output.decode("utf-8").split("\n")[2:-4]
for line in lines:
print(80 * "=")
model_path = line.split()[0]
print("\033[1mRunning", model_path, "\033[0m")
process = psutil.Process()
mem_before = process.memory_info().rss
try:
# Load the model
model, tokenizer = load(model_path)
config = load_config(model_path)
except Exception as e:
print(f"Failed to load model at {model_path}: {e}")
continue
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
tokenizer, config, prompt, num_images=len(image)
)
# Generate output
try:
start_time = time.time()
output = generate(model, tokenizer, image, formatted_prompt, verbose=True)
end_time = time.time()
print(output)
except Exception as e:
print(f"Failed to generate output for model at {model_path}: {e}")
continue
mem_after = process.memory_info().rss
print(f"Output generated in {end_time - start_time:.2f}s")
print(f"Memory used: {(mem_after - mem_before) / (1024 * 1024 * 1024):.2f} GB")
print(80 * "-", end="\n\n") |
Hey @jrp2014 Thank you very much! What is your proposed solution? To clarify, the need to trust the code and deprecations warnings come from HF transformers. Regarding the models that are slow, I think reducing the image size address this. |
I think that the main thing is to document the capabilities of the different models. Some are very fast, but don't produce very detailed results. Others are slow, but worth waiting for. Others are a bit too breezy for my taste. And some don't produce keywords / captions, only a description. I don't know whether that could be changed with a different system prompt, eg. The trust thing could be passed through and exposed as a parameter to load/config. Several model types seem to be unsupported. I have no idea how much work it would be to support them / their families. It's not clear to me why some models are limited by image size, others less so. The following seems to be in the vlm code.
|
Please share the command you used and the version of MLX-vlm |
It's just the script above, but without the try except around the generate.
|
I run your script and it worked. Unfortunely, I can't replicate this issue: Code: model_path = "mlx-community/deepseek-vl2-8bit"
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
import subprocess
import time
import psutil
print("\033[1mRunning", model_path, "\033[0m")
process = psutil.Process()
mem_before = process.memory_info().rss
try:
# Load the model
model, tokenizer = load(model_path)
config = load_config(model_path)
except Exception as e:
print(f"Failed to load model at {model_path}: {e}")
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
tokenizer, config, prompt, num_images=len(image)
)
# Generate output
try:
start_time = time.time()
output = generate(model, tokenizer, image, formatted_prompt, verbose=True)
end_time = time.time()
print(output)
except Exception as e:
print(f"Failed to generate output for model at {model_path}: {e}")
mem_after = process.memory_info().rss
print(f"Output generated in {end_time - start_time:.2f}s")
print(f"Memory used: {(mem_after - mem_before) / (1024 * 1024 * 1024):.2f} GB")
print(80 * "-", end="\n\n") Output: ==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg']
Prompt: <|User|>: <image>
Describe this image.
<|Assistant|>:
Two tabby cats lying on what appears to be a red couch or cushioned surface covered by a pink blanket that has fringed edges. The cat closest to the top of the frame is lying on its side facing leftward; it appears relaxed but alert as if observing something out of view. Its body language suggests relaxation yet attentiveness. Next to this first cat lies another tabby cat facing rightward towards the camera's perspective; only part of his face can be seen peeking over the
==========
Prompt: 2.412 tokens-per-sec
Generation: 34.423 tokens-per-sec
Output generated in 8.68s
Memory used: 5.40 GB
--------------------------------------------------------------------------------
|
Are you running |
Very curious. That works for me, too.
|
Version 18 seems to work more smoothly and faster. There are still a couple of models that produce strange results, but it is probably a model issue, rather than a vlm issue.
from transformers import AutoModelForCausalLM, AutoProcessor model_id= "<huggingface_model_id>" model.save_pretrained("<local_dir>")
python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>
|
Thanks! I will take a look at pixtral and deepseek That shouldn't happen. |
And llava-v1.6-34b-8bit seems to need some attention in future. Dolphin seems to need some extra parameters. Is Florence just not converted? |
Sure Florence no, But Florence-2, yes is converted 👌🏽 |
@jrp2014 just tested it and it works. For pixtral I would recommend using the one on the mlx-community hub: https://huggingface.co/mlx-community?search_models=pixtral However, I found a bug with the language only responses that will be fixed in the next release. |
Could you elaborate? I just tested and it is working fine. |
Pixtral and DeepSeek fix is here #165 and will be available as soon as tests pass. |
I'm just gong by the transcript above. |
What do you mean? |
Sorry, I must have ... errr ... hallucinated some of the reported issues / warnings. PS: are the READMES / examples up to date with the latest changes?
|
A new run with version 19 and the latest mlx (which seems to break a couple of models). Main thing is how fast this package has become!
from transformers import AutoModelForCausalLM, AutoProcessor model_id= "<huggingface_model_id>" model.save_pretrained("<local_dir>")
python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>
I found the warning that the Llava model issues:
|
Thank you very much! Your evals do help me a lot. Please run the Florence-2 from the MLX community repo. MLX-VLM only supports safetensors. |
I will fix all those. Regarding the warning I wouldn't worry, it's a transformers warning I will handle soon. |
Most of the models I have picked can provide some sort of description of the given image, but few can go further and provide keywords, of generally limited quality.
|
Today's run, on
from transformers import AutoModelForCausalLM, AutoProcessor model_id= "<huggingface_model_id>" model.save_pretrained("<local_dir>")
python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>
|
This is a hard image, as it is so non-descript, but some models make a reasonable stab, while others repeat nonsense. The mlx machinery is clearly working, as GPU usage is often pegged at 100%. Llama-3.2-11B-Vision-Instruct-8bit is pretty good at describing the image, but generates some laughable keywords. Molmo-7B-D-0924-8bit, idefics2-8b-chatty-8bit and pixtral-12b-8bit aren't bad.
from transformers import AutoModelForCausalLM, AutoProcessor model_id= "<huggingface_model_id>" model.save_pretrained("<local_dir>")
python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>
|
This is just a snapshot of my impressions of various models from the perspective of keywording / captioning.
In summary, at this point, there are a couple of good and fast models for this purpose, more just give a good, fast description of the image, others give a very fast but very succinct account of the image (without keywords). Several models are not yet supported, or have config files that mlx-vlm can't use.
A few models are just too slow or need too much memory (on a 128Gb Mac) to function.
I'll add / subtract from these as I experiment further.
The text was updated successfully, but these errors were encountered: