Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping track of the performance and compatibility of models #147

Open
jrp2014 opened this issue Dec 14, 2024 · 24 comments
Open

Keeping track of the performance and compatibility of models #147

jrp2014 opened this issue Dec 14, 2024 · 24 comments

Comments

@jrp2014
Copy link

jrp2014 commented Dec 14, 2024

This is just a snapshot of my impressions of various models from the perspective of keywording / captioning.

In summary, at this point, there are a couple of good and fast models for this purpose, more just give a good, fast description of the image, others give a very fast but very succinct account of the image (without keywords). Several models are not yet supported, or have config files that mlx-vlm can't use.

A few models are just too slow or need too much memory (on a 128Gb Mac) to function.

I'll add / subtract from these as I experiment further.

import mlx.core as mx # unused ...
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config, load_image_processor # unused?

from PIL import Image # Unused

import os
from pathlib import Path

# model_path="JosefAlbers/akemiH_MedQA_Reason"
# model_path = "HuggingFaceTB/SmolVLM-Instruct" # Fast, but too concise (eg, no keywords)
# model_path="Qwen/Qwen2-VL-7B-Instruct" # to be downloaded
# model_path="cognitivecomputations/dolphin-2.9.2-qwen2-72b" # To be downloaded
# model_path="distilbert/distilbert-base-uncased-finetuned-sst-2-english"
# model_path="google/siglip-so400m-patch14-384" # To be downloaded
# model_path="meta-llama/Llama-3.2-11B-Vision-Instruct" # Unusably slow, gives fairly detailed captions, but is not always accurate.  Uses over 90Gb.  No keywords.
# model_path="meta-llama/Llama-3.2-90B-Vision-Instruct"
# model_path="microsoft/Phi-3.5-mini-instruct"
# model_path="microsoft/Phi-3.5-vision-instruct" # provides a good description, but that is all
# model_path="mistral-community/pixtral-12b" # Unsupported model type: pixtral
# model_path="mlx-community/Florence-2-large-ft-bf16" # Produces gibberish v quickly.  Corrupt?
# model_path="mlx-community/Llama-3.2-11B-Vision-Instruct-8bit" # Much better than the native version, but still slows down
# model_path="mlx-community/Molmo-7B-D-0924-bf16"  ModuleNotFoundError: No module named 'einops'
# model_path="mlx-community/Phi-3.5-vision-instruct-bf16" # Pretty good description, but no keywords
# model_path="mlx-community/Qwen2-VL-72B-Instruct-8bit" # libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 77309411328 bytes.
# model_path="mlx-community/SmolVLM-Instruct-bf16" # Very good, fast description, but that is all
# model_path="mlx-community/dolphin-vision-72b-4bit" # mlx-vlm crash
# model_path="mlx-community/idefics2-8b-chatty-8bit" # ValueError: Unsupported model type: idefics2_vision
# model_path="mlx-community/llava-1.5-7b-4bit" # Vague description
# model_path="mlx-community/llava-v1.6-34b-8bit" # Pretty good
# model_path="mlx-community/llava-v1.6-mistral-7b-8bit" # V similar to the above
# model_path="mlx-community/paligemma2-3b-pt-896-4bit" # too innaccurate
# model_path="mlx-community/pixtral-12b-8bit" # A rival to llava 1.6
# model_path="mlx-community/whisper-tiny"
# model_path="mlx-community/paligemma2-10b-ft-docci-448-bf16" # Generates a good description, but no keywords
# https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3   # Very concise, no keywords
# model_path = "mlx-community/llava-1.5-7b-4bit"
# model_path = "mlx-community/llava-v1.6-mistral-7b-8bit"
# model_path = "mlx-community/pixtral-12b-8bit" # To the point
# model_path = "Qwen/Qwen2-VL-7B-Instruct"  # libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 269535412224 bytes which is greater than the maximum allowed buffer size of 28991029248 bytes.###
# model_path = "mlx-community/llava-v1.6-34b-8bit" # Slower but more precise
# model_path = "mlx-community/Phi-3.5-vision-instruct-bf16" # OK, but doesn't provide keywords
# model_path = "mistral-community/pixtral-12b"
# model_path = "meta-llama/Llama-3.2-11B-Vision-Instruct"  # needs about 95Gb, precise, but is slow
# model_path ="mlx-community/Qwen2-VL-72B-Instruct-8bit" # libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 135383101952 bytes which is greater than the maximum allowed buffer size of 77309411328 bytes.
# model_path ="mlx-community/dolphin-vision-72b-4bit"  # Needs image_processor = load_image_processor(model_path) 
# model_path = "meta-llama/Llama-3.2-11B-Vision-Instruct" # Very slow, gives more detailed captions, but is not always accurate.  Uses over 90Gb.  No keywords.
model_path = "OpenGVLab/InternVL2_5-38B" # ValueError: Model type internvl_chat not supported.

print("Model: ", model_path)

# Load the model
model, processor = load(model_path)
# processor = load_image_processor(model_path)
config = load_config(model_path)

prompt = "Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily"

picpath = "/Users/x/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print("Image: ", pic)

# Apply chat template
formatted_prompt = apply_chat_template(processor, config, prompt, num_images=1)

# Generate output
output = generate(model, processor, pic, formatted_prompt, max_tokens=500, verbose=True)
print(output)
@jrp2014
Copy link
Author

jrp2014 commented Dec 26, 2024

This has some output a variety of models for versions:

mlx                       0.21.1
mlx-lm                    0.20.4
mlx-vlm                   0.1.6

There are quite a few glitches for various models, including deprecation warnings, the need to trust code, assertion errors, and some models are too slow to be practical.

> python check_models.py
================================================================================
Running JosefAlbers/akemiH_MedQA_Reason 
Failed to load model at JosefAlbers/akemiH_MedQA_Reason: 404 Client Error. (Request ID: Root=1-676d9daa-33d271661f9fd5ba6a827885;6a9024f0-eee2-4a03-b82b-d3dccc08eb9a)

Repository Not Found for url: https://huggingface.co/api/models/JosefAlbers/akemiH_MedQA_Reason/revision/main.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
================================================================================
Running HuggingFaceTB/SmolVLM-Instruct 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 75459.74it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 17879.80it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Describe this image.<end_of_utterance>
Assistant:
 Two cats are sleeping on a pink blanket. The cat on the left is curled up with its head facing down, while the cat on the right is stretched out with its eyes closed. Both cats have collars and are resting comfortably.
==========
Prompt: 11.882 tokens-per-sec
Generation: 109.093 tokens-per-sec
 Two cats are sleeping on a pink blanket. The cat on the left is curled up with its head facing down, while the cat on the right is stretched out with its eyes closed. Both cats have collars and are resting comfortably.
Output generated in 2.22s
Memory used: 4.40 GB
--------------------------------------------------------------------------------

================================================================================
Running cognitivecomputations/dolphin-2.9.2-qwen2-72b 
Fetching 40 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 12618.24it/s]
ERROR:root:Model type qwen2 not supported.
Failed to load model at cognitivecomputations/dolphin-2.9.2-qwen2-72b: Model type qwen2 not supported.
================================================================================
Running distilbert/distilbert-base-uncased-finetuned-sst-2-english 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 13516.93it/s]
ERROR:root:Model type distilbert not supported.
Failed to load model at distilbert/distilbert-base-uncased-finetuned-sst-2-english: Model type distilbert not supported.
================================================================================
Running google/siglip-so400m-patch14-384 
Fetching 6 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 11765.23it/s]
ERROR:root:Model type siglip not supported.
Failed to load model at google/siglip-so400m-patch14-384: Model type siglip not supported.
================================================================================
Running meta-llama/Llama-3.2-11B-Vision-Instruct 
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 7169.75it/s]
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 14205.14it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Describe this image.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket, with two remote controls placed nearby. The cat on the left is smaller and has a fluffy tail, while the larger cat on the right has a shorter tail. Both cats are lying on their backs, with their paws stretched out to the sides.

The background of the image is a pink blanket that covers the couch, providing a comfortable and cozy atmosphere. The overall scene suggests a relaxing and peaceful environment, with the cats enjoying some downtime on the
==========
Prompt: 4.353 tokens-per-sec
Generation: 0.525 tokens-per-sec
The image shows two cats lying on a pink blanket, with two remote controls placed nearby. The cat on the left is smaller and has a fluffy tail, while the larger cat on the right has a shorter tail. Both cats are lying on their backs, with their paws stretched out to the sides.

The background of the image is a pink blanket that covers the couch, providing a comfortable and cozy atmosphere. The overall scene suggests a relaxing and peaceful environment, with the cats enjoying some downtime on the
Output generated in 192.98s
Memory used: 15.91 GB
--------------------------------------------------------------------------------

================================================================================
Running microsoft/Phi-3.5-mini-instruct 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 16928.27it/s]
ERROR:root:Model type phi3 not supported.
Failed to load model at microsoft/Phi-3.5-mini-instruct: Model type phi3 not supported.
================================================================================
Running microsoft/Phi-3.5-vision-instruct 
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 15315.66it/s]
The repository for /Users/jrp/.cache/huggingface/hub/models--microsoft--Phi-3.5-vision-instruct/snapshots/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/x/.cache/huggingface/hub/models--microsoft--Phi-3.5-vision-instruct/snapshots/4a0d683eba9f1d0cbfb6151705d1ee73c25a80ca.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:524: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 17109.63it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Describe this image.<|end|>
<|assistant|>

The image shows two cats lying on a pink sofa with their bodies stretched out and heads resting. There are two remote controls on the sofa, one to each side of the cats. The cat on the left has a collar with a bell, and both have striped fur patterns.<|end|>
==========
Prompt: 21.082 tokens-per-sec
Generation: 10.447 tokens-per-sec
The image shows two cats lying on a pink sofa with their bodies stretched out and heads resting. There are two remote controls on the sofa, one to each side of the cats. The cat on the left has a collar with a bell, and both have striped fur patterns.<|end|>
Output generated in 7.76s
Memory used: 7.84 GB
--------------------------------------------------------------------------------

================================================================================
Running mistral-community/pixtral-12b 
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 14577.05it/s]
Failed to load model at mistral-community/pixtral-12b: Unsupported model type: pixtral
================================================================================
Running mlx-community/Florence-2-large-ft-bf16 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 16529.28it/s]
The repository for /Users/x/.cache/huggingface/hub/models--mlx-community--Florence-2-large-ft-bf16/snapshots/a4fe21022ed39adba398a31ee7ba8269d6c68c84 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/jrp/.cache/huggingface/hub/models--mlx-community--Florence-2-large-ft-bf16/snapshots/a4fe21022ed39adba398a31ee7ba8269d6c68c84.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 21354.11it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: Describe this image.
<s>Two cats are laying on a pink blanket with two remotes.
==========
Prompt: 17.150 tokens-per-sec
Generation: 153.557 tokens-per-sec
<s>Two cats are laying on a pink blanket with two remotes.
Output generated in 1.06s
Memory used: 0.77 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/Llama-3.2-11B-Vision-Instruct-8bit 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 13774.40it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 20702.39it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Describe this image.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket, with two remote controls placed nearby. The cat on the left is smaller and has a fluffy tail, while the larger cat on the right appears to be pregnant. Both cats are lying on their backs, with their paws stretched out in front of them.

The background of the image is a pink blanket, which provides a comfortable and cozy setting for the cats to relax. The presence of the remote controls suggests that the cats may be watching TV or
==========
Prompt: 4.324 tokens-per-sec
Generation: 0.623 tokens-per-sec
The image shows two cats lying on a pink blanket, with two remote controls placed nearby. The cat on the left is smaller and has a fluffy tail, while the larger cat on the right appears to be pregnant. Both cats are lying on their backs, with their paws stretched out in front of them.

The background of the image is a pink blanket, which provides a comfortable and cozy setting for the cats to relax. The presence of the remote controls suggests that the cats may be watching TV or
Output generated in 162.92s
Memory used: -17.82 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/Llama-3.3-70B-Instruct-8bit 
Fetching 20 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 15128.24it/s]
ERROR:root:Model type llama not supported.
Failed to load model at mlx-community/Llama-3.3-70B-Instruct-8bit: Model type llama not supported.
================================================================================
Running mlx-community/Molmo-7B-D-0924-8bit 
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 14245.14it/s]
Failed to load model at mlx-community/Molmo-7B-D-0924-8bit: Expecting property name enclosed in double quotes: line 37 column 5 (char 998)
================================================================================
Running mlx-community/Molmo-7B-D-0924-bf16 
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 11896.86it/s]
Failed to load model at mlx-community/Molmo-7B-D-0924-bf16: Expected shape (1024, 37888) but received shape (1024, 588) for parameter vision_tower.image_vit.patch_embedding.weight
================================================================================
Running mlx-community/Phi-3.5-vision-instruct-bf16 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 11109.61it/s]
Failed to load model at mlx-community/Phi-3.5-vision-instruct-bf16: 'img_processor'
================================================================================
Running mlx-community/QVQ-72B-Preview-8bit 
Fetching 25 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 7972.14it/s]
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 14250.83it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
Describe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

So here we have two cats sleeping on a pink couch. It's a cozy scene, and both cats seem really relaxed. The cat on the left is lying on its side, stretched out with its paws tucked under its body. It has a striped pattern with shades of brown and black, and I can see a green collar around its neck. The other cat, on the right, is also on its side but with its paws extended forward. This one has a more tabby pattern with
==========
Prompt: 4.129 tokens-per-sec
Generation: 0.573 tokens-per-sec
So here we have two cats sleeping on a pink couch. It's a cozy scene, and both cats seem really relaxed. The cat on the left is lying on its side, stretched out with its paws tucked under its body. It has a striped pattern with shades of brown and black, and I can see a green collar around its neck. The other cat, on the right, is also on its side but with its paws extended forward. This one has a more tabby pattern with
Output generated in 183.76s
Memory used: 0.58 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/Qwen2-VL-7B-Instruct-8bit 
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 6433.80it/s]
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 27458.62it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

The image features two cats sleeping on a pink couch. One cat is lying down with its head resting on the other cat, which is also sleeping. The couch has a red blanket covering it, and there are two remote controls placed on the couch near the cats. The scene depicts a cozy and comfortable setting for the two feline companions.
==========
Prompt: 32.923 tokens-per-sec
Generation: 53.537 tokens-per-sec
The image features two cats sleeping on a pink couch. One cat is lying down with its head resting on the other cat, which is also sleeping. The couch has a red blanket covering it, and there are two remote controls placed on the couch near the cats. The scene depicts a cozy and comfortable setting for the two feline companions.
Output generated in 2.64s
Memory used: 7.69 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/SmolVLM-Instruct-bf16 
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 13793.27it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18801.51it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Describe this image.<end_of_utterance>
Assistant:
 Two cats are sleeping on a pink blanket. The cat on the left is curled up with its head facing down, while the cat on the right is stretched out with its eyes closed. Both cats have collars and are resting comfortably.
==========
Prompt: 10.793 tokens-per-sec
Generation: 105.234 tokens-per-sec
 Two cats are sleeping on a pink blanket. The cat on the left is curled up with its head facing down, while the cat on the right is stretched out with its eyes closed. Both cats have collars and are resting comfortably.
Output generated in 2.24s
Memory used: -13.31 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/deepseek-vl2-8bit 
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 34379.54it/s]
Some kwargs in processor config are unused and will not have any effect: sft_format, image_std, image_token, candidate_resolutions, mask_prompt, ignore_id, image_mean, add_special_token, pad_token, normalize, downsample_ratio, patch_size. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 60183.17it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Failed to generate output for model at mlx-community/deepseek-vl2-8bit: 
================================================================================
Running mlx-community/dolphin-vision-72b-4bit 
Fetching 19 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 13898.11it/s]
The repository for /Users/x/.cache/huggingface/hub/models--mlx-community--dolphin-vision-72b-4bit/snapshots/82156979ae25603e5d1bbec346559fe27d279f22 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/jrp/.cache/huggingface/hub/models--mlx-community--dolphin-vision-72b-4bit/snapshots/82156979ae25603e5d1bbec346559fe27d279f22.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] Failed to load model at mlx-community/dolphin-vision-72b-4bit: The repository for /Users/jrp/.cache/huggingface/hub/models--mlx-community--dolphin-vision-72b-4bit/snapshots/82156979ae25603e5d1bbec346559fe27d279f22 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//Users/jrp/.cache/huggingface/hub/models--mlx-community--dolphin-vision-72b-4bit/snapshots/82156979ae25603e5d1bbec346559fe27d279f22.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.
================================================================================
Running mlx-community/idefics2-8b-chatty-8bit 
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 20068.44it/s]
Failed to load model at mlx-community/idefics2-8b-chatty-8bit: Unsupported model type: idefics2_vision
================================================================================
Running mlx-community/llava-v1.6-34b-8bit 
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 8466.30it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 6025.79it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>user
<image>
Describe this image.<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
The image shows two cats lying on a pink surface, which appears to be a blanket or cushion. The cat on the left is stretched out with its body fully extended, while the cat on the right is curled up in a more compact position. Both cats are facing away from each other, and they seem to be resting or sleeping. There is a remote control placed between the two cats, suggesting that this scene might be taking place in a living room or similar setting. The image has a casual
==========
Prompt: 2.848 tokens-per-sec
Generation: 9.951 tokens-per-sec
The image shows two cats lying on a pink surface, which appears to be a blanket or cushion. The cat on the left is stretched out with its body fully extended, while the cat on the right is curled up in a more compact position. Both cats are facing away from each other, and they seem to be resting or sleeping. There is a remote control placed between the two cats, suggesting that this scene might be taking place in a living room or similar setting. The image has a casual
Output generated in 15.81s
Memory used: 1.38 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/llava-v1.6-mistral-7b-8bit 
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 20068.44it/s]
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 21272.89it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: [INST] <image>
Describe this image. [/INST]
In the tranquil setting of a cozy living room, two feline companions are captured in a moment of serene slumber. The first cat, a black and white tabby with striking blue eyes, is curled up on the left side of the pink blanket that adorns the couch. Its body is relaxed, with its head comfortably resting on the armrest of the couch.

On the right side of the blanket, a brown and black tabby cat is also
==========
Prompt: 15.255 tokens-per-sec
Generation: 44.796 tokens-per-sec
In the tranquil setting of a cozy living room, two feline companions are captured in a moment of serene slumber. The first cat, a black and white tabby with striking blue eyes, is curled up on the left side of the pink blanket that adorns the couch. Its body is relaxed, with its head comfortably resting on the armrest of the couch.

On the right side of the blanket, a brown and black tabby cat is also
Output generated in 3.85s
Memory used: 5.87 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/paligemma2-10b-ft-docci-448-6bit 
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 65664.25it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 23596.65it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats lying on a pink blanket with their heads down and eyes closed. The cat to the left is smaller than the one on the right, it has black and white stripes. The cat to the right is larger than the one on the left, it has brown and black stripes. There are two remote controls to the left of both cats on a red couch.
==========
Prompt: 2.239 tokens-per-sec
Generation: 35.980 tokens-per-sec
A top-down view of two cats lying on a pink blanket with their heads down and eyes closed. The cat to the left is smaller than the one on the right, it has black and white stripes. The cat to the right is larger than the one on the left, it has brown and black stripes. There are two remote controls to the left of both cats on a red couch.
Output generated in 4.96s
Memory used: 6.96 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/paligemma2-10b-ft-docci-448-bf16 
Fetching 10 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 18910.30it/s]
Fetching 10 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 25130.64it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats laying on a pink blanket with their heads down and eyes closed. The cat to the left is smaller than the one on the right, it has a black and white striped tail. The cat to the right is bigger than the one on the left, it has a black and white striped tail. There are two remote controls to the left of the cats, one on top of each other and both pointed towards the cat's heads.
==========
Prompt: 1.122 tokens-per-sec
Generation: 4.547 tokens-per-sec
A top-down view of two cats laying on a pink blanket with their heads down and eyes closed. The cat to the left is smaller than the one on the right, it has a black and white striped tail. The cat to the right is bigger than the one on the left, it has a black and white striped tail. There are two remote controls to the left of the cats, one on top of each other and both pointed towards the cat's heads.
Output generated in 25.93s
Memory used: -23.63 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/paligemma2-3b-ft-docci-448-bf16 
Fetching 8 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 5529.74it/s]
Fetching 8 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 11626.62it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats sleeping on a pink blanket with their heads down and arms stretched out. The cat on the right is brown and tan, and it has a white remote control between its legs. The cat on the left is black and gray, and it has a white remote control between its legs as well. The cat on the left is lying on the pink blanket, and its tail is sticking out to the right. The cat on the right is lying on the pink blanket as
==========
Prompt: 5.147 tokens-per-sec
Generation: 15.623 tokens-per-sec
A top-down view of two cats sleeping on a pink blanket with their heads down and arms stretched out. The cat on the right is brown and tan, and it has a white remote control between its legs. The cat on the left is black and gray, and it has a white remote control between its legs as well. The cat on the left is lying on the pink blanket, and its tail is sticking out to the right. The cat on the right is lying on the pink blanket as
Output generated in 7.88s
Memory used: 5.66 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/paligemma2-3b-pt-896-4bit 
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 6591.86it/s]
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 16786.81it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
Cat
==========
Prompt: 1.377 tokens-per-sec
Generation: 93.992 tokens-per-sec
Cat
Output generated in 4.16s
Memory used: 1.65 GB
--------------------------------------------------------------------------------

================================================================================
Running mlx-community/pixtral-12b-8bit 
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 27511.83it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 25631.86it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <s>[INST][IMG]Describe this image.[/INST]
In the tranquil setting of a pink couch, two feline companions are enjoying a moment of rest. The cat on the left, with its brown and black stripes, is sprawled out in a relaxed pose. Its body is stretched out along the length of the couch, with its head comfortably resting on a white remote control. The cat's paws are stretched out in front of it, as if reaching for something just beyond its grasp.

On the right, another cat is curled up in a perfect
==========
Prompt: 2.727 tokens-per-sec
Generation: 27.218 tokens-per-sec
In the tranquil setting of a pink couch, two feline companions are enjoying a moment of rest. The cat on the left, with its brown and black stripes, is sprawled out in a relaxed pose. Its body is stretched out along the length of the couch, with its head comfortably resting on a white remote control. The cat's paws are stretched out in front of it, as if reaching for something just beyond its grasp.

On the right, another cat is curled up in a perfect
Output generated in 7.51s
Memory used: 10.45 GB
--------------------------------------------------------------------------------

The script to reproduce this is:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
import subprocess
import time
import psutil


output = subprocess.check_output(
    ["/opt/homebrew/Caskroom/miniconda/base/envs/mlx/bin/huggingface-cli", "scan-cache"]
)
lines = output.decode("utf-8").split("\n")[2:-4]

for line in lines:
    print(80 * "=")
    model_path = line.split()[0]
    print("\033[1mRunning", model_path, "\033[0m")

    process = psutil.Process()
    mem_before = process.memory_info().rss

    try:
        # Load the model
        model, tokenizer = load(model_path)
        config = load_config(model_path)
    except Exception as e:
        print(f"Failed to load model at {model_path}: {e}")
        continue

    # Prepare input
    image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
    prompt = "Describe this image."

    # Apply chat template
    formatted_prompt = apply_chat_template(
        tokenizer, config, prompt, num_images=len(image)
    )

    # Generate output
    try:
        start_time = time.time()
        output = generate(model, tokenizer, image, formatted_prompt, verbose=True)
        end_time = time.time()
        print(output)
    except Exception as e:
        print(f"Failed to generate output for model at {model_path}: {e}")
        continue

    mem_after = process.memory_info().rss
    print(f"Output generated in {end_time - start_time:.2f}s")
    print(f"Memory used: {(mem_after - mem_before) / (1024 * 1024 * 1024):.2f} GB")

    print(80 * "-", end="\n\n")

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 26, 2024

Hey @jrp2014

Thank you very much!

What is your proposed solution?

To clarify, the need to trust the code and deprecations warnings come from HF transformers.

Regarding the models that are slow, I think reducing the image size address this.

@jrp2014
Copy link
Author

jrp2014 commented Dec 27, 2024

I think that the main thing is to document the capabilities of the different models. Some are very fast, but don't produce very detailed results. Others are slow, but worth waiting for. Others are a bit too breezy for my taste. And some don't produce keywords / captions, only a description. I don't know whether that could be changed with a different system prompt, eg.

The trust thing could be passed through and exposed as a parameter to load/config.

Several model types seem to be unsupported. I have no idea how much work it would be to support them / their families.

It's not clear to me why some models are limited by image size, others less so.

The following seems to be in the vlm code.

Running mlx-community/deepseek-vl2-8bit 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 18126.98it/s]
Some kwargs in processor config are unused and will not have any effect: image_token, normalize, candidate_resolutions, sft_format, mask_prompt, image_mean, patch_size, pad_token, add_special_token, downsample_ratio, image_std, ignore_id. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 26064.03it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/scripts/vlm/check_models.py", line 39, in <module>
    output = generate(model, tokenizer, image, formatted_prompt, verbose=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1258, in generate
    for (token, prob), n in zip(generator, range(max_tokens)):
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 1058, in generate_step
    outputs = model(input_ids, pixel_values, cache=cache, mask=mask, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/models/deepseek_vl_v2/deepseek_vl_v2.py", line 436, in __call__
    input_embeddings = self.get_input_embeddings(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/models/deepseek_vl_v2/deepseek_vl_v2.py", line 395, in get_input_embeddings
    assert total_tiles.shape[0] == sum(batch_num_tiles)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 27, 2024

Please share the command you used and the version of MLX-vlm

@jrp2014
Copy link
Author

jrp2014 commented Dec 27, 2024

It's just the script above, but without the try except around the generate.

mlx                       0.21.1
mlx-lm                    0.20.4
mlx-vlm                   0.1.6

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 27, 2024

I run your script and it worked.

Unfortunely, I can't replicate this issue:

Code:

model_path = "mlx-community/deepseek-vl2-8bit"

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
import subprocess
import time
import psutil


print("\033[1mRunning", model_path, "\033[0m")

process = psutil.Process()
mem_before = process.memory_info().rss

try:
    # Load the model
    model, tokenizer = load(model_path)
    config = load_config(model_path)
except Exception as e:
    print(f"Failed to load model at {model_path}: {e}")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    tokenizer, config, prompt, num_images=len(image)
)

# Generate output
try:
    start_time = time.time()
    output = generate(model, tokenizer, image, formatted_prompt, verbose=True)
    end_time = time.time()
    print(output)
except Exception as e:
    print(f"Failed to generate output for model at {model_path}: {e}")

mem_after = process.memory_info().rss
print(f"Output generated in {end_time - start_time:.2f}s")
print(f"Memory used: {(mem_after - mem_before) / (1024 * 1024 * 1024):.2f} GB")

print(80 * "-", end="\n\n")

Output:

==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Two tabby cats lying on what appears to be a red couch or cushioned surface covered by a pink blanket that has fringed edges. The cat closest to the top of the frame is lying on its side facing leftward; it appears relaxed but alert as if observing something out of view. Its body language suggests relaxation yet attentiveness. Next to this first cat lies another tabby cat facing rightward towards the camera's perspective; only part of his face can be seen peeking over the
==========
Prompt: 2.412 tokens-per-sec
Generation: 34.423 tokens-per-sec
Output generated in 8.68s
Memory used: 5.40 GB
--------------------------------------------------------------------------------

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 27, 2024

Are you running transformers ==4.47.1 ?

@jrp2014
Copy link
Author

jrp2014 commented Dec 27, 2024

Very curious. That works for me, too.

Running mlx-community/deepseek-vl2-8bit 
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 91028.30it/s]
Some kwargs in processor config are unused and will not have any effect: mask_prompt, downsample_ratio, add_special_token, pad_token, image_mean, normalize, image_std, image_token, patch_size, ignore_id, sft_format, candidate_resolutions. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 27194.99it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Two tabby cats lying on what appears to be a red couch or cushioned surface covered by a pink blanket that has fringed edges. The cat closest to the top of the frame is lying on its side facing leftward; it appears relaxed but alert as if observing something out of view. Its body language suggests relaxation yet attentiveness. Next to this first cat lies another tabby cat facing rightward towards the camera's perspective; only part of his face can be seen peeking over the
==========
Prompt: 3.949 tokens-per-sec
Generation: 57.769 tokens-per-sec
Two tabby cats lying on what appears to be a red couch or cushioned surface covered by a pink blanket that has fringed edges. The cat closest to the top of the frame is lying on its side facing leftward; it appears relaxed but alert as if observing something out of view. Its body language suggests relaxation yet attentiveness. Next to this first cat lies another tabby cat facing rightward towards the camera's perspective; only part of his face can be seen peeking over the
Output generated in 5.39s
Memory used: 27.57 GB
--------------------------------------------------------------------------------

@jrp2014
Copy link
Author

jrp2014 commented Dec 31, 2024

Version 18 seems to work more smoothly and faster. There are still a couple of models that produce strange results, but it is probably a model issue, rather than a vlm issue.

> python check_models.py
mlx-vlm version: <module 'mlx_vlm.version' from '/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/mlx_vlm/version.py'>
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running HuggingFaceTB/SmolVLM-Instruct 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 78398.21it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 27900.03it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Describe this image.<end_of_utterance>
Assistant:
 Two cats are sleeping on a pink blanket.
==========
Prompt: 1195 tokens, 1100.103 tokens-per-sec
Generation: 10 tokens, 131.207 tokens-per-sec
Peak memory: 6.007 GB
 Two cats are sleeping on a pink blanket.
Output generated in 1.98s
Memory used: 4.44 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running OpenGVLab/InternVL2_5-8B 
Fetching 21 files: 100%|█████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 19499.75it/s]
ERROR:root:Model type internvl_chat not supported.
Failed to load model at OpenGVLab/InternVL2_5-8B: Model type internvl_chat not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running cognitivecomputations/dolphin-2.9.2-qwen2-72b 
Fetching 40 files: 100%|██████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 8632.03it/s]
ERROR:root:Model type qwen2 not supported.
Failed to load model at cognitivecomputations/dolphin-2.9.2-qwen2-72b: Model type qwen2 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running distilbert/distilbert-base-uncased-finetuned-sst-2-english 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 19301.91it/s]
ERROR:root:Model type distilbert not supported.
Failed to load model at distilbert/distilbert-base-uncased-finetuned-sst-2-english: Model type distilbert not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running google/siglip-so400m-patch14-384 
Fetching 6 files: 100%|█████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6643.56it/s]
ERROR:root:Model type siglip not supported.
Failed to load model at google/siglip-so400m-patch14-384: Model type siglip not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running meta-llama/Llama-3.2-11B-Vision-Instruct 
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 12155.05it/s]
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 16705.94it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Describe this image.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy kitten with a long tail and a green collar, while the cat on the right is a larger, tabby cat with a short tail. Both cats are lying on their backs, with their paws stretched out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a white remote control and the one on the right being a silver remote control. The background of the image is a pink blanket that the cats are lying on.
==========
Prompt: 16 tokens, 4.755 tokens-per-sec
Generation: 130 tokens, 3.722 tokens-per-sec
Peak memory: 31.532 GB
The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy kitten with a long tail and a green collar, while the cat on the right is a larger, tabby cat with a short tail. Both cats are lying on their backs, with their paws stretched out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a white remote control and the one on the right being a silver remote control. The background of the image is a pink blanket that the cats are lying on.
Output generated in 38.91s
Memory used: 17.99 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Florence-2-large-ft 
preprocessor_config.json: 100%|████████████████████████████████████████████████████████| 806/806 [00:00<00:00, 9.80MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 34.0/34.0 [00:00<00:00, 37.2kB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████| 51.0/51.0 [00:00<00:00, 60.2kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████| 2.44k/2.44k [00:00<00:00, 34.0MB/s]
processing_florence2.py: 100%|█████████████████████████████████████████████████████| 46.4k/46.4k [00:00<00:00, 2.65MB/s]
modeling_florence2.py: 100%|█████████████████████████████████████████████████████████| 127k/127k [00:00<00:00, 1.91MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████| 1.10M/1.10M [00:00<00:00, 4.89MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 4.97MB/s]
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 15.36it/s]
ERROR:root:No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Failed to load model at microsoft/Florence-2-large-ft: 
No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Create safetensors using the following code:

from transformers import AutoModelForCausalLM, AutoProcessor

model_id= "<huggingface_model_id>"
model = AutoModelForCausalLM.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

model.save_pretrained("<local_dir>")
processor.save_pretrained("<local_dir>")

Then use the <local_dir> as the --hf-path in the convert script.

python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>

        
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-mini-instruct 
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 8301.76it/s]
ERROR:root:Model type phi3 not supported.
Failed to load model at microsoft/Phi-3.5-mini-instruct: Model type phi3 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-vision-instruct 
Fetching 14 files: 100%|██████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 9180.78it/s]
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:524: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 21976.14it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Describe this image.<|end|>
<|assistant|>

The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
==========
Prompt: 771 tokens, 892.031 tokens-per-sec
Generation: 88 tokens, 10.103 tokens-per-sec
Peak memory: 31.532 GB
The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
Output generated in 10.19s
Memory used: 7.47 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mistral-community/pixtral-12b 
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 36303.84it/s]
Failed to load model at mistral-community/pixtral-12b: Unsupported model type: pixtral
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Florence-2-large-ft-bf16 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 11541.31it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 14708.25it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: Describe this image.
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
==========
Prompt: 7 tokens, 24.534 tokens-per-sec
Generation: 256 tokens, 170.670 tokens-per-sec
Peak memory: 31.532 GB
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
Output generated in 2.44s
Memory used: 1.57 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.2-11B-Vision-Instruct-8bit 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 14768.68it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 27832.14it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Describe this image.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy kitten with a long tail and a green collar. The cat on the right is a larger, tabby cat with a short tail and no collar. Both cats are lying on their backs, with their paws stretched out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a white remote control and the one on the right being a gray remote control. The background of the image is a pink blanket that the cats are lying on. Overall, the image appears to be a casual and relaxed scene, with the cats enjoying a peaceful moment together.
==========
Prompt: 15 tokens, 4.637 tokens-per-sec
Generation: 154 tokens, 9.136 tokens-per-sec
Peak memory: 31.532 GB
The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy kitten with a long tail and a green collar. The cat on the right is a larger, tabby cat with a short tail and no collar. Both cats are lying on their backs, with their paws stretched out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a white remote control and the one on the right being a gray remote control. The background of the image is a pink blanket that the cats are lying on. Overall, the image appears to be a casual and relaxed scene, with the cats enjoying a peaceful moment together.
Output generated in 20.67s
Memory used: 10.69 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.3-70B-Instruct-8bit 
Fetching 20 files: 100%|█████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 17407.36it/s]
ERROR:root:Model type llama not supported.
Failed to load model at mlx-community/Llama-3.3-70B-Instruct-8bit: Model type llama not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-8bit 
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 5921.54it/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 15470.00it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: Describe this image.
 In this image, two cats are peacefully sleeping on a red couch, which is covered with a pink blanket. The cat on the left is a gray and black striped feline with a white belly and a black tail, wearing a green collar. This cat is lying on its side with its front paws stretched out and its back legs hanging off the edge of the couch. The cat on the right is a larger, chunkier cat with a mix of brown, black, and white fur, also with a white belly. This cat is lying on its side with its back legs hanging off the couch and its front legs resting on the pink blanket. Both cats are nestled close to each other, with a white remote control positioned between them. The scene is cozy and serene, capturing a moment of tranquility as the two cats rest comfortably on the couch.
==========
Prompt: 749 tokens, 76.519 tokens-per-sec
Generation: 172 tokens, 41.105 tokens-per-sec
Peak memory: 31.532 GB
 In this image, two cats are peacefully sleeping on a red couch, which is covered with a pink blanket. The cat on the left is a gray and black striped feline with a white belly and a black tail, wearing a green collar. This cat is lying on its side with its front paws stretched out and its back legs hanging off the edge of the couch. The cat on the right is a larger, chunkier cat with a mix of brown, black, and white fur, also with a white belly. This cat is lying on its side with its back legs hanging off the couch and its front legs resting on the pink blanket. Both cats are nestled close to each other, with a white remote control positioned between them. The scene is cozy and serene, capturing a moment of tranquility as the two cats rest comfortably on the couch.
Output generated in 14.60s
Memory used: 8.26 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-bf16 
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 17806.01it/s]
Failed to load model at mlx-community/Molmo-7B-D-0924-bf16: Expected shape (1024, 37888) but received shape (1024, 588) for parameter vision_tower.image_vit.patch_embedding.weight
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Phi-3.5-vision-instruct-bf16 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 47290.50it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 14824.89it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Describe this image.<|end|>
<|assistant|>

The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
==========
Prompt: 771 tokens, 898.032 tokens-per-sec
Generation: 88 tokens, 10.040 tokens-per-sec
Peak memory: 31.532 GB
The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
Output generated in 10.25s
Memory used: 7.81 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/QVQ-72B-Preview-8bit 
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 17546.45it/s]
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 14133.66it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
Describe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

So here we have two cats sleeping on a pink couch. It's a cozy scene, and both cats seem really relaxed. The cat on the left is lying on its side, stretched out with its paws tucked under its body. It has a striped pattern with shades of brown and black, and I can see a green collar around its neck, which might have a tag or something on it.

The cat on the right is also lying on its side but is positioned a bit           differently. Its paws are stretched out in front of it, and it has a similar striped pattern, maybe a bit more orange in tone. Both cats are completely at ease, and their fur looks soft and well-groomed.

Between them, there are two remote controls resting on the couch. One is longer and white with various buttons, and the other is smaller and gray with blue buttons. These remotes are probably for the TV or some other electronic devices.

The couch itself is a vibrant pink color, which stands out nicely against the cats' fur. The fabric looks smooth and comfortable, perfect for napping. The way the cats are spread out suggests that they feel very safe and secure in this environment.

Overall, the scene is peaceful and homely, with the cats clearly enjoying their
==========
Prompt: 433 tokens, 52.096 tokens-per-sec
Generation: 256 tokens, 0.559 tokens-per-sec
Peak memory: 78.649 GB
So here we have two cats sleeping on a pink couch. It's a cozy scene, and both cats seem really relaxed. The cat on the left is lying on its side, stretched out with its paws tucked under its body. It has a striped pattern with shades of brown and black, and I can see a green collar around its neck, which might have a tag or something on it.

The cat on the right is also lying on its side but is positioned a bit differently. Its paws are stretched out in front of it, and it has a similar striped pattern, maybe a bit more orange in tone. Both cats are completely at ease, and their fur looks soft and well-groomed.

Between them, there are two remote controls resting on the couch. One is longer and white with various buttons, and the other is smaller and gray with blue buttons. These remotes are probably for the TV or some other electronic devices.

The couch itself is a vibrant pink color, which stands out nicely against the cats' fur. The fabric looks smooth and comfortable, perfect for napping. The way the cats are spread out suggests that they feel very safe and secure in this environment.

Overall, the scene is peaceful and homely, with the cats clearly enjoying their
Output generated in 466.73s
Memory used: 7.42 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Qwen2-VL-7B-Instruct-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 29416.51it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18020.64it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

The image features two cats sleeping on a pink couch. One cat is lying on its back, while the other cat is curled up in a ball. There are two remote controls on the couch, one near each cat. The scene depicts a cozy and comfortable setting for the cats to rest.
==========
Prompt: 416 tokens, 561.236 tokens-per-sec
Generation: 59 tokens, 59.476 tokens-per-sec
Peak memory: 78.649 GB
The image features two cats sleeping on a pink couch. One cat is lying on its back, while the other cat is curled up in a ball. There are two remote controls on the couch, one near each cat. The scene depicts a cozy and comfortable setting for the cats to rest.
Output generated in 2.33s
Memory used: 8.28 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/SmolVLM-Instruct-bf16 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 26658.71it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 33802.32it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Describe this image.<end_of_utterance>
Assistant:
 Two cats are sleeping on a pink blanket.
==========
Prompt: 1195 tokens, 1107.044 tokens-per-sec
Generation: 10 tokens, 130.926 tokens-per-sec
Peak memory: 78.649 GB
 Two cats are sleeping on a pink blanket.
Output generated in 1.78s
Memory used: 3.77 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/deepseek-vl2-8bit 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 18205.66it/s]
Some kwargs in processor config are unused and will not have any effect: patch_size, image_std, normalize, downsample_ratio, add_special_token, image_token, sft_format, mask_prompt, pad_token, image_mean, candidate_resolutions, ignore_id. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 20803.49it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Failed to generate output for model at mlx-community/deepseek-vl2-8bit: 'DeepseekVLV2Processor' object has no attribute 'eos_token_id'
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/dolphin-vision-72b-4bit 
Fetching 19 files: 100%|█████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 23874.11it/s]
Failed to load model at mlx-community/dolphin-vision-72b-4bit: TextConfig.__init__() missing 7 required positional arguments: 'model_type', 'hidden_size', 'num_hidden_layers', 'intermediate_size', 'num_attention_heads', 'rms_norm_eps', and 'vocab_size'
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/idefics2-8b-chatty-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 30859.38it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 19144.79it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: User: Describe this image.<image><end_of_utterance>
Assistant:
In the tranquil setting of a pink couch, two feline companions are captured in a moment of serene slumber. The cat on the left, a striking tabby with a coat of brown and black stripes, is curled up in a peaceful slumber. Its head is gently resting on the arm of the couch, a picture of contentment.

On the right, a gray and white cat is also enjoying the comfort of the couch. Its body is stretched out in a relaxed pose, with its head resting on the arm of the couch as well. The two cats, despite their different fur colors, share a common bond in their choice of resting spot and their peaceful demeanor.

The image is a beautiful snapshot of these two cats, their colors contrasting yet complementing each other, as they enjoy a shared moment of rest on the pink couch.<end_of_utterance>
==========
Prompt: 79 tokens, 150.162 tokens-per-sec
Generation: 182 tokens, 48.290 tokens-per-sec
Peak memory: 78.649 GB
In the tranquil setting of a pink couch, two feline companions are captured in a moment of serene slumber. The cat on the left, a striking tabby with a coat of brown and black stripes, is curled up in a peaceful slumber. Its head is gently resting on the arm of the couch, a picture of contentment.

On the right, a gray and white cat is also enjoying the comfort of the couch. Its body is stretched out in a relaxed pose, with its head resting on the arm of the couch as well. The two cats, despite their different fur colors, share a common bond in their choice of resting spot and their peaceful demeanor.

The image is a beautiful snapshot of these two cats, their colors contrasting yet complementing each other, as they enjoy a shared moment of rest on the pink couch.<end_of_utterance>
Output generated in 4.82s
Memory used: 8.36 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-34b-8bit 
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 21438.11it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 18438.88it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>user
<image>
Describe this image.<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
The image shows two cats lying on a pink surface, which appears to be a blanket or a piece of furniture. The cat on the left is a tabby with a mix of dark and light stripes, and it is lying on its side with its head resting on its front paws. The cat on the right is also a tabby, with a similar pattern of stripes, and it is lying on its stomach with its head turned to the side. Between the two cats, there is a remote control with a white and gray color scheme. The background is not clearly visible, but it seems to be an indoor setting with a red surface, possibly a couch or a chair. The image has a casual, candid quality, capturing a moment of rest for the cats.
==========
Prompt: 15 tokens, 3.777 tokens-per-sec
Generation: 155 tokens, 10.029 tokens-per-sec
Peak memory: 78.649 GB
The image shows two cats lying on a pink surface, which appears to be a blanket or a piece of furniture. The cat on the left is a tabby with a mix of dark and light stripes, and it is lying on its side with its head resting on its front paws. The cat on the right is also a tabby, with a similar pattern of stripes, and it is lying on its stomach with its head turned to the side. Between the two cats, there is a remote control with a white and gray color scheme. The background is not clearly visible, but it seems to be an indoor setting with a red surface, possibly a couch or a chair. The image has a casual, candid quality, capturing a moment of rest for the cats.
Output generated in 20.00s
Memory used: 34.31 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-mistral-7b-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 24733.00it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 20246.04it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: [INST] <image>
Describe this image. [/INST]
In the image, there are two cats lounging on a pink blanket that is spread out on a red couch. The cat on the left is a gray and white tabby, while the cat on the right is a brown and black tabby. Both cats are lying on their sides, with their heads resting on the arm of the couch. The tabby cat on the right is facing towards the left side of the image, while the tabby cat on the left is facing towards the right side of the image. In the background, there is a remote control resting on the arm of the couch. The cats appear to be relaxed and comfortable in their environment. 
==========
Prompt: 16 tokens, 15.044 tokens-per-sec
Generation: 136 tokens, 47.442 tokens-per-sec
Peak memory: 78.649 GB
In the image, there are two cats lounging on a pink blanket that is spread out on a red couch. The cat on the left is a gray and white tabby, while the cat on the right is a brown and black tabby. Both cats are lying on their sides, with their heads resting on the arm of the couch. The tabby cat on the right is facing towards the left side of the image, while the tabby cat on the left is facing towards the right side of the image. In the background, there is a remote control resting on the arm of the couch. The cats appear to be relaxed and comfortable in their environment. 
Output generated in 4.47s
Memory used: 6.42 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-6bit 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 21169.99it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 35734.22it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats lying on a pink blanket. The cat on the left is lying on its side, and its head is facing to the right. Its body is facing to the left. Its tail is sticking out to the left. The cat on the right is lying on its side, and its head is facing down. Its body is facing to the right. Its tail is sticking out to the right. Two remote controls are on the pink blanket, one on each side of the cats. The one on the left is gray, and the one on the right is white.
==========
Prompt: 1030 tokens, 474.817 tokens-per-sec
Generation: 121 tokens, 38.998 tokens-per-sec
Peak memory: 78.649 GB
A top-down view of two cats lying on a pink blanket. The cat on the left is lying on its side, and its head is facing to the right. Its body is facing to the left. Its tail is sticking out to the left. The cat on the right is lying on its side, and its head is facing down. Its body is facing to the right. Its tail is sticking out to the right. Two remote controls are on the pink blanket, one on each side of the cats. The one on the left is gray, and the one on the right is white.
Output generated in 5.80s
Memory used: 7.74 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-bf16 
Fetching 10 files: 100%|██████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9969.82it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 74104.31it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats laying on a pink blanket. The cat on the left is a gray tabby cat with black stripes and a black tail. It is laying on its side with its head on the pink blanket and its body facing the left side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. The cat on the right is a brown tabby cat with black stripes. It is laying on its side with its head on the pink blanket and its body facing the right side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image.
==========
Prompt: 1030 tokens, 463.995 tokens-per-sec
Generation: 159 tokens, 4.600 tokens-per-sec
Peak memory: 78.649 GB
A top-down view of two cats laying on a pink blanket. The cat on the left is a gray tabby cat with black stripes and a black tail. It is laying on its side with its head on the pink blanket and its body facing the left side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. The cat on the right is a brown tabby cat with black stripes. It is laying on its side with its head on the pink blanket and its body facing the right side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image.
Output generated in 37.35s
Memory used: 18.02 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-ft-docci-448-bf16 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 65027.97it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 16802.42it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats sleeping on a pink blanket. The cat on the left is a gray and black tabby cat, and it is lying on its side with its head facing the right. Its tail is sticking out to the left. Its front paws are hanging off the blanket on the left. A white remote is between the cats. The cat on the right is lying on its side, and its head is facing the left. Its tail is sticking out to the left. A white remote is between the cats' bodies.
==========
Prompt: 1030 tokens, 1312.847 tokens-per-sec
Generation: 109 tokens, 16.107 tokens-per-sec
Peak memory: 78.649 GB
A top-down view of two cats sleeping on a pink blanket. The cat on the left is a gray and black tabby cat, and it is lying on its side with its head facing the right. Its tail is sticking out to the left. Its front paws are hanging off the blanket on the left. A white remote is between the cats. The cat on the right is lying on its side, and its head is facing the left. Its tail is sticking out to the left. A white remote is between the cats' bodies.
Output generated in 8.09s
Memory used: 5.61 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-pt-896-4bit 
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 6680.35it/s]
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 11455.38it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
Cat.
==========
Prompt: 4102 tokens, 1175.475 tokens-per-sec
Generation: 3 tokens, 65.632 tokens-per-sec
Peak memory: 78.649 GB
Cat.
Output generated in 4.13s
Memory used: 1.64 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/pixtral-12b-8bit 
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 30373.50it/s]
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 19964.23it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <s>[INST][IMG]Describe this image.[/INST]
In the image, there are two cats resting on a pink sofa. One cat is positioned towards the left side, stretching out with its body relaxed and one paw extended upwards. The other cat is on the right side, lying down with its head resting on the sofa. There are two remote controls in the scene. One remote control is placed towards the top left corner, while the other is situated more towards the center-right of the image.
==========
Prompt: 1238 tokens, 420.970 tokens-per-sec
Generation: 89 tokens, 27.962 tokens-per-sec
Peak memory: 78.649 GB
In the image, there are two cats resting on a pink sofa. One cat is positioned towards the left side, stretching out with its body relaxed and one paw extended upwards. The other cat is on the right side, lying down with its head resting on the sofa. There are two remote controls in the scene. One remote control is placed towards the top left corner, while the other is situated more towards the center-right of the image.
Output generated in 6.70s
Memory used: 12.57 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

Thanks!

I will take a look at pixtral and deepseek

That shouldn't happen.

@jrp2014
Copy link
Author

jrp2014 commented Dec 31, 2024

And llava-v1.6-34b-8bit seems to need some attention in future. Dolphin seems to need some extra parameters. Is Florence just not converted?

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

Sure

Florence no,

But Florence-2, yes is converted 👌🏽

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

@jrp2014 just tested it and it works.

For pixtral I would recommend using the one on the mlx-community hub:

https://huggingface.co/mlx-community?search_models=pixtral

However, I found a bug with the language only responses that will be fixed in the next release.
Screenshot 2024-12-31 at 5 50 49 PM

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

llava-v1.6-34b-8bit seems to need some attention in future.
Dolphin seems to need some extra parameters.

Could you elaborate? I just tested and it is working fine.

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

Pixtral and DeepSeek fix is here #165 and will be available as soon as tests pass.

@jrp2014
Copy link
Author

jrp2014 commented Dec 31, 2024

I'm just gong by the transcript above.

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

I'm just gong by the transcript above.

What do you mean?

@jrp2014
Copy link
Author

jrp2014 commented Dec 31, 2024

Sorry, I must have ... errr ... hallucinated some of the reported issues / warnings.

PS: are the READMES / examples up to date with the latest changes?

Running HuggingFaceTB/SmolVLM-Instruct 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 78398.21it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 

:

Running OpenGVLab/InternVL2_5-8B 
Fetching 21 files: 100%|█████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 19499.75it/s]
ERROR:root:Model type internvl_chat not supported.
Failed to load model at OpenGVLab/InternVL2_5-8B: Model type internvl_chat not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running cognitivecomputations/dolphin-2.9.2-qwen2-72b 
Fetching 40 files: 100%|██████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 8632.03it/s]
ERROR:root:Model type qwen2 not supported.
Failed to load model at cognitivecomputations/dolphin-2.9.2-qwen2-72b: Model type qwen2 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running distilbert/distilbert-base-uncased-finetuned-sst-2-english 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 19301.91it/s]
ERROR:root:Model type distilbert not supported.
Failed to load model at distilbert/distilbert-base-uncased-finetuned-sst-2-english: Model type distilbert not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running google/siglip-so400m-patch14-384 
Fetching 6 files: 100%|█████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6643.56it/s]
ERROR:root:Model type siglip not supported.
Failed to load model at google/siglip-so400m-patch14-384: Model type siglip not supported.

 :

Running mlx-community/Llama-3.3-70B-Instruct-8bit 
Fetching 20 files: 100%|█████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 17407.36it/s]
ERROR:root:Model type llama not supported.
Failed to load model at mlx-community/Llama-3.3-70B-Instruct-8bit: Model type llama not supported.

 : 

unning mlx-community/Molmo-7B-D-0924-bf16 
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 17806.01it/s]
Failed to load model at mlx-community/Molmo-7B-D-0924-bf16: Expected shape (1024, 37888) but received shape (1024, 588) for parameter vision_tower.image_vit.patch_embedding.weight
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

: 

Running microsoft/Florence-2-large-ft 
preprocessor_config.json: 100%|████████████████████████████████████████████████████████| 806/806 [00:00<00:00, 9.80MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 34.0/34.0 [00:00<00:00, 37.2kB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████| 51.0/51.0 [00:00<00:00, 60.2kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████| 2.44k/2.44k [00:00<00:00, 34.0MB/s]
processing_florence2.py: 100%|█████████████████████████████████████████████████████| 46.4k/46.4k [00:00<00:00, 2.65MB/s]
modeling_florence2.py: 100%|█████████████████████████████████████████████████████████| 127k/127k [00:00<00:00, 1.91MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████| 1.10M/1.10M [00:00<00:00, 4.89MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 4.97MB/s]
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 15.36it/s]
ERROR:root:No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Failed to load model at microsoft/Florence-2-large-ft: 
No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Create safetensors using the following code:
from transformers import AutoModelForCausalLM, AutoProcessor

model_id= "<huggingface_model_id>"
model = AutoModelForCausalLM.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

model.save_pretrained("<local_dir>")
processor.save_pretrained("<local_dir>")

Then use the <local_dir> as the --hf-path in the convert script.
python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>

        
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-mini-instruct 
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 8301.76it/s]
ERROR:root:Model type phi3 not supported.
Failed to load model at microsoft/Phi-3.5-mini-instruct: Model type phi3 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-vision-instruct 
Fetching 14 files: 100%|██████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 9180.78it/s]
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:524: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 21976.14it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Describe this image.<|end|>
<|assistant|>

The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
==========
Prompt: 771 tokens, 892.031 tokens-per-sec
Generation: 88 tokens, 10.103 tokens-per-sec
Peak memory: 31.532 GB
The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
Output generated in 10.19s
Memory used: 7.47 GB

@jrp2014
Copy link
Author

jrp2014 commented Dec 31, 2024

A new run with version 19 and the latest mlx (which seems to break a couple of models).

Main thing is how fast this package has become!

mlx version: 0.21.1.dev20241231+8ecdfb718
mlx-vlm version: 0.1.9
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running HuggingFaceTB/SmolVLM-Instruct 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 13725.57it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 19996.68it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Describe this image.<end_of_utterance>
Assistant:
 Two cats are sleeping on a pink blanket.
==========
Prompt: 1195 tokens, 1100.219 tokens-per-sec
Generation: 10 tokens, 130.233 tokens-per-sec
Peak memory: 6.007 GB
 Two cats are sleeping on a pink blanket.
Output generated in 1.87s
Memory used: 4.43 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running OpenGVLab/InternVL2_5-8B 
Fetching 21 files: 100%|██████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 9835.89it/s]
ERROR:root:Model type internvl_chat not supported.
Failed to load model at OpenGVLab/InternVL2_5-8B: Model type internvl_chat not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running cognitivecomputations/dolphin-2.9.2-qwen2-72b 
Fetching 40 files: 100%|██████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 6299.17it/s]
ERROR:root:Model type qwen2 not supported.
Failed to load model at cognitivecomputations/dolphin-2.9.2-qwen2-72b: Model type qwen2 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running distilbert/distilbert-base-uncased-finetuned-sst-2-english 
Fetching 10 files: 100%|██████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9700.06it/s]
ERROR:root:Model type distilbert not supported.
Failed to load model at distilbert/distilbert-base-uncased-finetuned-sst-2-english: Model type distilbert not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running google/siglip-so400m-patch14-384 
Fetching 6 files: 100%|████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 13691.96it/s]
ERROR:root:Model type siglip not supported.
Failed to load model at google/siglip-so400m-patch14-384: Model type siglip not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running meta-llama/Llama-3.2-11B-Vision-Instruct 
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 13977.91it/s]
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 38956.38it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Describe this image.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy tabby cat with a long tail and a green collar. The cat on the right is a larger, striped tabby cat with a short tail. Both cats are lying on their backs, with their paws stretched out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a standard TV remote and the one on the right being a smaller, more compact remote. The background of the image is a pink blanket that covers the couch, which is visible in the top-left corner of the image. Overall, the image appears to be a playful and cozy scene, with the two cats enjoying a relaxing moment together on the couch.
==========
Prompt: 16 tokens, 3.599 tokens-per-sec
Generation: 170 tokens, 3.725 tokens-per-sec
Peak memory: 31.499 GB
The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy tabby cat with a long tail and a green collar. The cat on the right is a larger, striped tabby cat with a short tail. Both cats are lying on their backs, with their paws stretched out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a standard TV remote and the one on the right being a smaller, more compact remote. The background of the image is a pink blanket that covers the couch, which is visible in the top-left corner of the image. Overall, the image appears to be a playful and cozy scene, with the two cats enjoying a relaxing moment together on the couch.
Output generated in 50.78s
Memory used: 17.35 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Florence-2-large-ft 
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 90742.15it/s]
ERROR:root:No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Failed to load model at microsoft/Florence-2-large-ft: 
No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Create safetensors using the following code:

from transformers import AutoModelForCausalLM, AutoProcessor

model_id= "<huggingface_model_id>"
model = AutoModelForCausalLM.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

model.save_pretrained("<local_dir>")
processor.save_pretrained("<local_dir>")

Then use the <local_dir> as the --hf-path in the convert script.

python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>

        
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-mini-instruct 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 35544.95it/s]
ERROR:root:Model type phi3 not supported.
Failed to load model at microsoft/Phi-3.5-mini-instruct: Model type phi3 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-vision-instruct 
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 52805.99it/s]
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:524: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 40948.57it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Describe this image.<|end|>
<|assistant|>

The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
==========
Prompt: 771 tokens, 868.850 tokens-per-sec
Generation: 88 tokens, 10.306 tokens-per-sec
Peak memory: 31.499 GB
The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
Output generated in 9.99s
Memory used: 7.56 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mistral-community/pixtral-12b 
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 18520.62it/s]
Failed to load model at mistral-community/pixtral-12b: Unsupported model type: pixtral
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Florence-2-large-ft-bf16 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18900.36it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 34615.99it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: Describe this image.
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
==========
Prompt: 7 tokens, 24.636 tokens-per-sec
Generation: 256 tokens, 172.000 tokens-per-sec
Peak memory: 31.499 GB
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
Output generated in 2.37s
Memory used: 1.58 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.2-11B-Vision-Instruct-8bit 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 19047.70it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 12468.20it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Describe this image.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy tabby with a long tail and a green collar. The cat on the right is a larger, striped tabby with a shorter tail. Both cats are lying on their sides, facing each other, and appear to be sleeping or resting. The remote controls are placed on the couch behind the cats, suggesting that they may have been watching TV or playing with the remotes before falling asleep. The overall atmosphere of the image is one of relaxation and contentment, as the cats seem to be enjoying a peaceful moment together.
==========
Prompt: 15 tokens, 4.640 tokens-per-sec
Generation: 137 tokens, 8.683 tokens-per-sec
Peak memory: 31.499 GB
The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them. The cat on the left is a small, fluffy tabby with a long tail and a green collar. The cat on the right is a larger, striped tabby with a shorter tail. Both cats are lying on their sides, facing each other, and appear to be sleeping or resting. The remote controls are placed on the couch behind the cats, suggesting that they may have been watching TV or playing with the remotes before falling asleep. The overall atmosphere of the image is one of relaxation and contentment, as the cats seem to be enjoying a peaceful moment together.
Output generated in 19.67s
Memory used: 10.63 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.3-70B-Instruct-8bit 
Fetching 20 files: 100%|█████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 19440.57it/s]
ERROR:root:Model type llama not supported.
Failed to load model at mlx-community/Llama-3.3-70B-Instruct-8bit: Model type llama not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-8bit 
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 8411.74it/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 11860.88it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: Describe this image.
 In this image, two cats are peacefully sleeping on a red couch, which is covered with a pink blanket. The cat on the left is a gray and black striped feline with a white belly and a black tail, wearing a green collar. This cat is lying on its side with its front paws stretched out and its back legs hanging off the edge of the couch. The cat on the right is a larger, chunkier cat with a mix of brown, black, and white fur, also with a white belly. This cat is lying on its side with its back legs hanging off the couch and its front legs resting on the pink blanket. Both cats are nestled close to each other, with a white remote control positioned between them. The scene is cozy and serene, capturing a moment of tranquility as the two cats rest comfortably on the couch.
==========
Prompt: 749 tokens, 80.020 tokens-per-sec
Generation: 172 tokens, 40.705 tokens-per-sec
Peak memory: 31.499 GB
 In this image, two cats are peacefully sleeping on a red couch, which is covered with a pink blanket. The cat on the left is a gray and black striped feline with a white belly and a black tail, wearing a green collar. This cat is lying on its side with its front paws stretched out and its back legs hanging off the edge of the couch. The cat on the right is a larger, chunkier cat with a mix of brown, black, and white fur, also with a white belly. This cat is lying on its side with its back legs hanging off the couch and its front legs resting on the pink blanket. Both cats are nestled close to each other, with a white remote control positioned between them. The scene is cozy and serene, capturing a moment of tranquility as the two cats rest comfortably on the couch.
Output generated in 14.21s
Memory used: 8.27 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-bf16 
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 13554.30it/s]
Failed to load model at mlx-community/Molmo-7B-D-0924-bf16: Expected shape (1024, 37888) but received shape (1024, 588) for parameter vision_tower.image_vit.patch_embedding.weight
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Phi-3.5-vision-instruct-bf16 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 13025.79it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 14586.93it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Describe this image.<|end|>
<|assistant|>

The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
==========
Prompt: 771 tokens, 891.027 tokens-per-sec
Generation: 88 tokens, 10.272 tokens-per-sec
Peak memory: 31.499 GB
The image shows two cats lying on a pink couch. The cat on the left is a tabby with a mix of dark and light stripes, while the cat on the right is a solid grey. Both cats have their eyes closed and appear to be sleeping. There are two remote controls on the couch, one blue and one white. The couch has a red cushion on top.<|end|>
Output generated in 10.05s
Memory used: 7.81 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/QVQ-72B-Preview-8bit 
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 13068.00it/s]
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 19410.88it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
Describe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Failed to generate output for model at mlx-community/QVQ-72B-Preview-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Qwen2-VL-7B-Instruct-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 14458.96it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 67378.38it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe this image.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Failed to generate output for model at mlx-community/Qwen2-VL-7B-Instruct-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/SmolVLM-Instruct-bf16 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 40820.48it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 266305.02it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Describe this image.<end_of_utterance>
Assistant:
 Two cats are sleeping on a pink blanket.
==========
Prompt: 1195 tokens, 1084.444 tokens-per-sec
Generation: 10 tokens, 58.950 tokens-per-sec
Peak memory: 86.818 GB
 Two cats are sleeping on a pink blanket.
Output generated in 1.96s
Memory used: 4.30 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/deepseek-vl2-8bit 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 12820.59it/s]
Some kwargs in processor config are unused and will not have any effect: pad_token, image_token, image_mean, add_special_token, mask_prompt, patch_size, sft_format, image_std, downsample_ratio, candidate_resolutions, normalize, ignore_id. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 19907.25it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Two tabby cats lying on what appears to be a red couch or cushioned surface covered by a pink blanket that has fringed edges. The cat closest to the top of the frame is lying on its side facing leftward; it appears relaxed but alert as if observing something out of view. Its body language suggests relaxation but also attentiveness. Next to this first cat lies another tabby cat facing rightward towards the camera's perspective. This second cat’s head rests low near the ground while all four paws are stretched out comfortably forward. A remote control rests between them both towards the upper left corner above where they rest. The background consists mainly of more red fabric which could suggest additional seating like an L-shaped sofa set against a wall painted white. There are no other discernible items present within the immediate vicinity visible from the given angle. This setting gives off a cozy domestic vibe typical for pet owners who enjoy having their pets close at hand while relaxing at home.
==========
Prompt: 1034 tokens, 321.844 tokens-per-sec
Generation: 196 tokens, 61.231 tokens-per-sec
Peak memory: 86.818 GB
Two tabby cats lying on what appears to be a red couch or cushioned surface covered by a pink blanket that has fringed edges. The cat closest to the top of the frame is lying on its side facing leftward; it appears relaxed but alert as if observing something out of view. Its body language suggests relaxation but also attentiveness. Next to this first cat lies another tabby cat facing rightward towards the camera's perspective. This second cat’s head rests low near the ground while all four paws are stretched out comfortably forward. A remote control rests between them both towards the upper left corner above where they rest. The background consists mainly of more red fabric which could suggest additional seating like an L-shaped sofa set against a wall painted white. There are no other discernible items present within the immediate vicinity visible from the given angle. This setting gives off a cozy domestic vibe typical for pet owners who enjoy having their pets close at hand while relaxing at home.
Output generated in 7.01s
Memory used: 27.38 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/dolphin-vision-72b-4bit 
Fetching 19 files: 100%|█████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 23689.59it/s]
Failed to load model at mlx-community/dolphin-vision-72b-4bit: TextConfig.__init__() missing 7 required positional arguments: 'model_type', 'hidden_size', 'num_hidden_layers', 'intermediate_size', 'num_attention_heads', 'rms_norm_eps', and 'vocab_size'
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/idefics2-8b-chatty-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 15096.48it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 16241.25it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: User: Describe this image.<image><end_of_utterance>
Assistant:
In the tranquil setting of a pink couch, two feline companions are captured in a moment of serene slumber. The cat on the left, a striking tabby with a coat of brown and black stripes, is curled up in a peaceful slumber. Its head is gently resting on the arm of the couch, a picture of contentment.

On the right, a gray and white cat is also enjoying the comfort of the couch. Its body is stretched out in a relaxed pose, with its head resting on the arm of the couch as well. The two cats, despite their different fur colors, share a common bond in their choice of resting spot and their peaceful demeanor.

The image is a beautiful snapshot of these two cats, their colors contrasting yet complementing each other, as they enjoy a shared moment of rest on the pink couch.<end_of_utterance>
==========
Prompt: 79 tokens, 152.960 tokens-per-sec
Generation: 182 tokens, 48.485 tokens-per-sec
Peak memory: 86.818 GB
In the tranquil setting of a pink couch, two feline companions are captured in a moment of serene slumber. The cat on the left, a striking tabby with a coat of brown and black stripes, is curled up in a peaceful slumber. Its head is gently resting on the arm of the couch, a picture of contentment.

On the right, a gray and white cat is also enjoying the comfort of the couch. Its body is stretched out in a relaxed pose, with its head resting on the arm of the couch as well. The two cats, despite their different fur colors, share a common bond in their choice of resting spot and their peaceful demeanor.

The image is a beautiful snapshot of these two cats, their colors contrasting yet complementing each other, as they enjoy a shared moment of rest on the pink couch.<end_of_utterance>
Output generated in 4.91s
Memory used: 8.29 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-34b-8bit 
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 22075.28it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 18828.40it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>user
<image>
Describe this image.<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
The image shows two cats lying on a pink surface, which appears to be a blanket or a piece of furniture. The cat on the left is a tabby with a mix of dark and light stripes, and it is lying on its side with its head resting on its front paws. The cat on the right is also a tabby, with a similar pattern of stripes, and it is lying on its stomach with its head turned to the side. Between the two cats, there is a remote control with a white and gray color scheme. The background is not clearly visible, but it seems to be an indoor setting with a red surface, possibly a couch or a chair. The image has a casual, candid quality, capturing a moment of rest for the cats.
==========
Prompt: 15 tokens, 3.744 tokens-per-sec
Generation: 155 tokens, 10.019 tokens-per-sec
Peak memory: 86.818 GB
The image shows two cats lying on a pink surface, which appears to be a blanket or a piece of furniture. The cat on the left is a tabby with a mix of dark and light stripes, and it is lying on its side with its head resting on its front paws. The cat on the right is also a tabby, with a similar pattern of stripes, and it is lying on its stomach with its head turned to the side. Between the two cats, there is a remote control with a white and gray color scheme. The background is not clearly visible, but it seems to be an indoor setting with a red surface, possibly a couch or a chair. The image has a casual, candid quality, capturing a moment of rest for the cats.
Output generated in 20.07s
Memory used: 34.25 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-mistral-7b-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 48582.67it/s]
Fetching 12 files: 100%|████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 142987.64it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: [INST] <image>
Describe this image. [/INST]
In the image, there are two cats lounging on a pink blanket that is spread out on a red couch. The cat on the left is a gray and white tabby, while the cat on the right is a brown and black tabby. Both cats are lying on their sides, with their heads resting on the arm of the couch. The tabby cat on the right is facing towards the left side of the image, while the tabby cat on the left is facing towards the right side of the image. In the background, there is a remote control resting on the arm of the couch. The cats appear to be relaxed and comfortable in their environment. 
==========
Prompt: 16 tokens, 15.223 tokens-per-sec
Generation: 136 tokens, 47.645 tokens-per-sec
Peak memory: 86.818 GB
In the image, there are two cats lounging on a pink blanket that is spread out on a red couch. The cat on the left is a gray and white tabby, while the cat on the right is a brown and black tabby. Both cats are lying on their sides, with their heads resting on the arm of the couch. The tabby cat on the right is facing towards the left side of the image, while the tabby cat on the left is facing towards the right side of the image. In the background, there is a remote control resting on the arm of the couch. The cats appear to be relaxed and comfortable in their environment. 
Output generated in 4.56s
Memory used: 6.56 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-6bit 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 16031.74it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 20712.61it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats lying on a pink blanket. The cat on the left is lying on its side, and its head is facing to the right. Its body is facing to the left. Its tail is sticking out to the left. The cat on the right is lying on its side, and its head is facing down. Its body is facing to the right. Its tail is sticking out to the right. Two remote controls are on the pink blanket, one on each side of the cats. The one on the left is gray, and the one on the right is white.
==========
Prompt: 1030 tokens, 487.085 tokens-per-sec
Generation: 121 tokens, 38.952 tokens-per-sec
Peak memory: 86.818 GB
A top-down view of two cats lying on a pink blanket. The cat on the left is lying on its side, and its head is facing to the right. Its body is facing to the left. Its tail is sticking out to the left. The cat on the right is lying on its side, and its head is facing down. Its body is facing to the right. Its tail is sticking out to the right. Two remote controls are on the pink blanket, one on each side of the cats. The one on the left is gray, and the one on the right is white.
Output generated in 5.87s
Memory used: 7.77 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-bf16 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 45491.37it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 30109.86it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats laying on a pink blanket. The cat on the left is a gray tabby cat with black stripes and a black tail. It is laying on its side with its head on the pink blanket and its body facing the left side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. The cat on the right is a brown tabby cat with black stripes. It is laying on its side with its head on the pink blanket and its body facing the right side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image.
==========
Prompt: 1030 tokens, 463.413 tokens-per-sec
Generation: 159 tokens, 4.624 tokens-per-sec
Peak memory: 86.818 GB
A top-down view of two cats laying on a pink blanket. The cat on the left is a gray tabby cat with black stripes and a black tail. It is laying on its side with its head on the pink blanket and its body facing the left side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. The cat on the right is a brown tabby cat with black stripes. It is laying on its side with its head on the pink blanket and its body facing the right side of the image. Its front legs are stretched out in front of it, and its back legs are curled up. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image.
Output generated in 37.25s
Memory used: 18.09 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-ft-docci-448-bf16 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 16844.59it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 37744.02it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
A top-down view of two cats sleeping on a pink blanket. The cat on the left is a gray and black tabby cat, and it is lying on its side with its head facing the right. Its tail is sticking out to the left. Its front paws are hanging off the blanket on the left. A white remote is between the cats. The cat on the right is lying on its side, and its head is facing the left. Its tail is sticking out to the left. A white remote is between the cats' bodies.
==========
Prompt: 1030 tokens, 1315.313 tokens-per-sec
Generation: 109 tokens, 16.431 tokens-per-sec
Peak memory: 86.818 GB
A top-down view of two cats sleeping on a pink blanket. The cat on the left is a gray and black tabby cat, and it is lying on its side with its head facing the right. Its tail is sticking out to the left. Its front paws are hanging off the blanket on the left. A white remote is between the cats. The cat on the right is lying on its side, and its head is facing the left. Its tail is sticking out to the left. A white remote is between the cats' bodies.
Output generated in 8.04s
Memory used: 5.49 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-pt-896-4bit 
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 155344.59it/s]
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 26837.41it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Describe this image.
Cat.
==========
Prompt: 4102 tokens, 1259.661 tokens-per-sec
Generation: 3 tokens, 71.009 tokens-per-sec
Peak memory: 86.818 GB
Cat.
Output generated in 3.86s
Memory used: 1.66 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/pixtral-12b-8bit 
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 35710.02it/s]
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 23068.67it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <s>[INST][IMG]Describe this image.[/INST]
In the image, there are two cats resting on a pink sofa. One cat is positioned towards the left side, stretching out with its body relaxed and one paw extended upwards. The other cat is on the right side, lying down with its head resting on the sofa. There are two remote controls in the scene. One remote control is placed towards the top left corner, while the other is situated more towards the center-right of the image.
==========
Prompt: 1238 tokens, 424.716 tokens-per-sec
Generation: 89 tokens, 28.493 tokens-per-sec
Peak memory: 86.818 GB
In the image, there are two cats resting on a pink sofa. One cat is positioned towards the left side, stretching out with its body relaxed and one paw extended upwards. The other cat is on the right side, lying down with its head resting on the sofa. There are two remote controls in the scene. One remote control is placed towards the top left corner, while the other is situated more towards the center-right of the image.
Output generated in 6.63s
Memory used: 12.58 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I found the warning that the Llava model issues:

Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 22075.28it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 18828.40it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>user
<image>
Describe this image.<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

Thank you very much!

Your evals do help me a lot.

Please run the Florence-2 from the MLX community repo.

MLX-VLM only supports safetensors.

@Blaizzy
Copy link
Owner

Blaizzy commented Dec 31, 2024

I will fix all those.

Regarding the warning I wouldn't worry, it's a transformers warning I will handle soon.

@jrp2014
Copy link
Author

jrp2014 commented Dec 31, 2024

Most of the models I have picked can provide some sort of description of the given image, but few can go further and provide keywords, of generally limited quality.

python check_models.py
mlx version: 0.21.1.dev20241231+8ecdfb718
mlx-vlm version: 0.1.9
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running HuggingFaceTB/SmolVLM-Instruct 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 22733.36it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 25140.68it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<end_of_utterance>
Assistant:
 Two cats sleeping on a pink blanket with a remote control.
==========
Prompt: 1217 tokens, 1104.561 tokens-per-sec
Generation: 13 tokens, 127.383 tokens-per-sec
Peak memory: 6.007 GB
 Two cats sleeping on a pink blanket with a remote control.
Output generated in 1.94s
Memory used: 4.41 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running OpenGVLab/InternVL2_5-8B 
Fetching 21 files: 100%|█████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 10453.40it/s]
ERROR:root:Model type internvl_chat not supported.
Failed to load model at OpenGVLab/InternVL2_5-8B: Model type internvl_chat not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running cognitivecomputations/dolphin-2.9.2-qwen2-72b 
Fetching 40 files: 100%|██████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 7803.72it/s]
ERROR:root:Model type qwen2 not supported.
Failed to load model at cognitivecomputations/dolphin-2.9.2-qwen2-72b: Model type qwen2 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running distilbert/distilbert-base-uncased-finetuned-sst-2-english 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 13586.99it/s]
ERROR:root:Model type distilbert not supported.
Failed to load model at distilbert/distilbert-base-uncased-finetuned-sst-2-english: Model type distilbert not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running google/siglip-so400m-patch14-384 
Fetching 6 files: 100%|████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 11853.90it/s]
ERROR:root:Model type siglip not supported.
Failed to load model at google/siglip-so400m-patch14-384: Model type siglip not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running meta-llama/Llama-3.2-11B-Vision-Instruct 
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 10344.39it/s]
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 5814.12it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket with two remote controls nearby.

The cat on the left is a small, fluffy tabby cat with a long tail and a green collar. The cat on the right is a larger, striped tabby cat with a short tail. Both cats are lying on their backs, with their paws splayed out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a standard TV remote and the one on the right being a smaller, more compact remote.

The background of the image is a pink blanket that covers the couch, which is red. The overall atmosphere of the image suggests a cozy and relaxing scene, with the two cats enjoying a peaceful moment together.
==========
Prompt: 36 tokens, 9.343 tokens-per-sec
Generation: 154 tokens, 3.783 tokens-per-sec
Peak memory: 31.499 GB
The image shows two cats lying on a pink blanket with two remote controls nearby.

The cat on the left is a small, fluffy tabby cat with a long tail and a green collar. The cat on the right is a larger, striped tabby cat with a short tail. Both cats are lying on their backs, with their paws splayed out to the sides. The remote controls are placed on the couch behind the cats, with the one on the left being a standard TV remote and the one on the right being a smaller, more compact remote.

The background of the image is a pink blanket that covers the couch, which is red. The overall atmosphere of the image suggests a cozy and relaxing scene, with the two cats enjoying a peaceful moment together.
Output generated in 45.34s
Memory used: 18.03 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-mini-instruct 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 30393.51it/s]
ERROR:root:Model type phi3 not supported.
Failed to load model at microsoft/Phi-3.5-mini-instruct: Model type phi3 not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-vision-instruct 
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 16167.47it/s]
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:524: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 20227.44it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|end|>
<|assistant|>

Two cats, one with a striped coat and the other with a tabby coat, are sleeping on a pink couch. There are two remote controls on the couch, one on the left side and the other on the right side. The couch has a red cushion and a white cushion.<|end|>
==========
Prompt: 795 tokens, 915.214 tokens-per-sec
Generation: 70 tokens, 10.551 tokens-per-sec
Peak memory: 31.499 GB
Two cats, one with a striped coat and the other with a tabby coat, are sleeping on a pink couch. There are two remote controls on the couch, one on the left side and the other on the right side. The couch has a red cushion and a white cushion.<|end|>
Output generated in 8.43s
Memory used: 7.49 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mistral-community/pixtral-12b 
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 23215.70it/s]
Failed to load model at mistral-community/pixtral-12b: Unsupported model type: pixtral
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Florence-2-large-ft-bf16 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 17488.41it/s]
configuration_florence2.py: 100%|██████████████████████████████████████████████████| 15.1k/15.1k [00:00<00:00, 19.9MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Florence-2-large-ft:
- configuration_florence2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 21769.74it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
==========
Prompt: 29 tokens, 99.215 tokens-per-sec
Generation: 256 tokens, 171.530 tokens-per-sec
Peak memory: 31.499 GB
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
Output generated in 2.70s
Memory used: 1.58 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.2-11B-Vision-Instruct-8bit 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 16637.46it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 15033.35it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them.

* Two cats are lying on a pink blanket:
	+ The cat on the left is smaller and has a fluffy tail.
	+ The cat on the right is larger and has a more mottled coat.
* The cats are lying on a pink blanket:
	+ The blanket appears to be made of a soft, plush material.
	+ It is spread out on a flat surface, possibly a couch or bed.
* There are two remote controls on the couch behind the cats:
	+ The remote controls are placed on the back of the couch, near the headrest.
	+ They appear to be standard TV remote controls, with buttons and a screen.

Overall, the image suggests that the cats are relaxing on the couch, possibly watching TV or taking a nap. The presence of the remote controls implies that the cats may be enjoying some entertainment or leisure time.
==========
Prompt: 35 tokens, 10.729 tokens-per-sec
Generation: 198 tokens, 8.708 tokens-per-sec
Peak memory: 31.499 GB
The image shows two cats lying on a pink blanket, with two remote controls placed on the couch behind them.

* Two cats are lying on a pink blanket:
	+ The cat on the left is smaller and has a fluffy tail.
	+ The cat on the right is larger and has a more mottled coat.
* The cats are lying on a pink blanket:
	+ The blanket appears to be made of a soft, plush material.
	+ It is spread out on a flat surface, possibly a couch or bed.
* There are two remote controls on the couch behind the cats:
	+ The remote controls are placed on the back of the couch, near the headrest.
	+ They appear to be standard TV remote controls, with buttons and a screen.

Overall, the image suggests that the cats are relaxing on the couch, possibly watching TV or taking a nap. The presence of the remote controls implies that the cats may be enjoying some entertainment or leisure time.
Output generated in 27.17s
Memory used: 10.61 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.3-70B-Instruct-8bit 
Fetching 20 files: 100%|█████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 19082.37it/s]
ERROR:root:Model type llama not supported.
Failed to load model at mlx-community/Llama-3.3-70B-Instruct-8bit: Model type llama not supported.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-8bit 
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 6364.65it/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 11955.97it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
 Two cats sleeping on a red couch with a pink blanket. One cat is gray with black stripes, the other is brown with black stripes. Both have white bellies and paws. Two remote controls are visible between the cats. The scene is cozy and peaceful, with the cats resting comfortably on the couch.

Cats, couch, pink blanket, remotes, striped, white, cozy, peaceful
==========
Prompt: 769 tokens, 84.397 tokens-per-sec
Generation: 81 tokens, 40.397 tokens-per-sec
Peak memory: 31.499 GB
 Two cats sleeping on a red couch with a pink blanket. One cat is gray with black stripes, the other is brown with black stripes. Both have white bellies and paws. Two remote controls are visible between the cats. The scene is cozy and peaceful, with the cats resting comfortably on the couch.

Cats, couch, pink blanket, remotes, striped, white, cozy, peaceful
Output generated in 11.88s
Memory used: 8.24 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-bf16 
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 15345.01it/s]
Failed to load model at mlx-community/Molmo-7B-D-0924-bf16: Expected shape (1024, 37888) but received shape (1024, 588) for parameter vision_tower.image_vit.patch_embedding.weight
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Phi-3.5-vision-instruct-bf16 
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 10123.65it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 17448.30it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|user|>
<|image_1|>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|end|>
<|assistant|>

Two cats, one with a striped coat and the other with a tabby coat, are sleeping on a pink couch. There are two remote controls on the couch, one on the left side and the other on the right side. The couch has a red cushion and a white cushion.<|end|>
==========
Prompt: 795 tokens, 925.052 tokens-per-sec
Generation: 70 tokens, 10.533 tokens-per-sec
Peak memory: 31.499 GB
Two cats, one with a striped coat and the other with a tabby coat, are sleeping on a pink couch. There are two remote controls on the couch, one on the left side and the other on the right side. The couch has a red cushion and a white cushion.<|end|>
Output generated in 8.26s
Memory used: 7.86 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/QVQ-72B-Preview-8bit 
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 15545.97it/s]
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 29102.86it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Failed to generate output for model at mlx-community/QVQ-72B-Preview-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Qwen2-VL-7B-Instruct-8bit 
Fetching 12 files: 100%|████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 184365.01it/s]
Fetching 12 files: 100%|████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 189930.75it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Failed to generate output for model at mlx-community/Qwen2-VL-7B-Instruct-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/SmolVLM-Instruct-bf16 
Fetching 12 files: 100%|████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 105296.33it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 36028.38it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>User:<image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<end_of_utterance>
Assistant:
 Two cats sleeping on a pink blanket with a remote control.
==========
Prompt: 1217 tokens, 1081.130 tokens-per-sec
Generation: 13 tokens, 125.429 tokens-per-sec
Peak memory: 86.818 GB
 Two cats sleeping on a pink blanket with a remote control.
Output generated in 3.45s
Memory used: 4.30 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/deepseek-vl2-8bit 
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 9263.67it/s]
Some kwargs in processor config are unused and will not have any effect: patch_size, candidate_resolutions, add_special_token, pad_token, ignore_id, image_mean, mask_prompt, sft_format, image_token, downsample_ratio, normalize, image_std. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 12146.57it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|User|>: <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.

<|Assistant|>:
The image shows two tabby cats lying on a pink surface. One cat is lying on its side, while the other is lying on its side with its head on the surface. Near the top left of the image, there is a remote control. The setting appears to show a cozy, indoor environment.
==========
Prompt: 1056 tokens, 375.070 tokens-per-sec
Generation: 62 tokens, 52.335 tokens-per-sec
Peak memory: 86.818 GB
The image shows two tabby cats lying on a pink surface. One cat is lying on its side, while the other is lying on its side with its head on the surface. Near the top left of the image, there is a remote control. The setting appears to show a cozy, indoor environment.
Output generated in 6.43s
Memory used: 27.38 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/dolphin-vision-72b-4bit 
Fetching 19 files: 100%|██████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 8378.03it/s]
Failed to load model at mlx-community/dolphin-vision-72b-4bit: TextConfig.__init__() missing 7 required positional arguments: 'model_type', 'hidden_size', 'num_hidden_layers', 'intermediate_size', 'num_attention_heads', 'rms_norm_eps', and 'vocab_size'
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/idefics2-8b-chatty-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 10388.37it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 19321.17it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: User: Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<image><end_of_utterance>
Assistant:
In the tranquil setting of a living room, two feline companions find solace on a vibrant pink couch. The cat on the left, a striking tabby with a coat of brown and black stripes, lies in repose. Its head is gently resting on the arm of the couch, embodying the epitome of relaxation. 

On the right, a gray and white cat is also at ease. Its head is comfortably nestled on the arm of the couch, mirroring its tabby companion's posture. The couch, with its soft pink hue, provides a stark contrast to the cats' fur, making them stand out in the scene.

Adding an element of humor to this peaceful tableau are two remote controls. One is located near the tabby's head, while the other is closer to the gray and white cat. Their presence suggests a recent television viewing session, now interrupted by the cats' nap.

This image captures a moment of serenity and companionship between two cats, set against the backdrop of a typical living room scene. The comma-separated keywords or tags for this image could include: "tabby cat", "gray and white
==========
Prompt: 105 tokens, 254.823 tokens-per-sec
Generation: 256 tokens, 49.101 tokens-per-sec
Peak memory: 86.818 GB
In the tranquil setting of a living room, two feline companions find solace on a vibrant pink couch. The cat on the left, a striking tabby with a coat of brown and black stripes, lies in repose. Its head is gently resting on the arm of the couch, embodying the epitome of relaxation. 

On the right, a gray and white cat is also at ease. Its head is comfortably nestled on the arm of the couch, mirroring its tabby companion's posture. The couch, with its soft pink hue, provides a stark contrast to the cats' fur, making them stand out in the scene.

Adding an element of humor to this peaceful tableau are two remote controls. One is located near the tabby's head, while the other is closer to the gray and white cat. Their presence suggests a recent television viewing session, now interrupted by the cats' nap.

This image captures a moment of serenity and companionship between two cats, set against the backdrop of a typical living room scene. The comma-separated keywords or tags for this image could include: "tabby cat", "gray and white
Output generated in 6.48s
Memory used: 8.30 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-34b-8bit 
Fetching 17 files: 100%|██████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 7407.35it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 11410.33it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <|im_start|>user
<image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
Caption: Two cats sleeping on a pink blanket.

Description: In the image, there are two cats lying on a pink blanket. The cat on the left is a tabby cat with black stripes, while the cat on the right is a calico cat with a mix of black, orange, and white fur. Both cats are sleeping peacefully, with their eyes closed and their bodies relaxed. The pink blanket provides a soft and comfortable surface for the cats to rest on.

Keywords: cats, sleeping, pink blanket, tabby, calico, stripes, fur, relaxation, comfort, peaceful, rest, feline, pet, domestic, animal, cozy, warm, soft, home, indoor, tranquility, serenity, contentment, happiness, well-being, restful, tranquil, serene, content, happy, well-rested, well-being, well-rested, well-cared-for, well-loved, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-
==========
Prompt: 37 tokens, 8.961 tokens-per-sec
Generation: 256 tokens, 10.336 tokens-per-sec
Peak memory: 86.818 GB
Caption: Two cats sleeping on a pink blanket.

Description: In the image, there are two cats lying on a pink blanket. The cat on the left is a tabby cat with black stripes, while the cat on the right is a calico cat with a mix of black, orange, and white fur. Both cats are sleeping peacefully, with their eyes closed and their bodies relaxed. The pink blanket provides a soft and comfortable surface for the cats to rest on.

Keywords: cats, sleeping, pink blanket, tabby, calico, stripes, fur, relaxation, comfort, peaceful, rest, feline, pet, domestic, animal, cozy, warm, soft, home, indoor, tranquility, serenity, contentment, happiness, well-being, restful, tranquil, serene, content, happy, well-rested, well-being, well-rested, well-cared-for, well-loved, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-taken-care-of, well-
Output generated in 29.71s
Memory used: 34.27 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-mistral-7b-8bit 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 66664.43it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 24432.84it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: [INST] <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily. [/INST]
Caption: Two cats lounging on a pink blanket.

Description: The image shows two cats lying on a pink blanket. The cat on the left is a kitten with a gray and white striped coat, while the cat on the right is a larger cat with a brown and black tabby pattern. Both cats appear to be sleeping or resting, with their bodies stretched out on the blanket. In the background, there is a red couch and a remote control.

Keywords or tags: cats, kitten, tabby, striped, blanket, couch, remote control, sleeping, resting, home, indoor, pets. 
==========
Prompt: 42 tokens, 38.654 tokens-per-sec
Generation: 133 tokens, 48.147 tokens-per-sec
Peak memory: 86.818 GB
Caption: Two cats lounging on a pink blanket.

Description: The image shows two cats lying on a pink blanket. The cat on the left is a kitten with a gray and white striped coat, while the cat on the right is a larger cat with a brown and black tabby pattern. Both cats appear to be sleeping or resting, with their bodies stretched out on the blanket. In the background, there is a red couch and a remote control.

Keywords or tags: cats, kitten, tabby, striped, blanket, couch, remote control, sleeping, resting, home, indoor, pets. 
Output generated in 4.40s
Memory used: 6.42 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-6bit 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 12965.39it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 10827.50it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
A top-down view of two brown and black striped cats laying on a pink blanket. The cat on the left is laying on its side with its head facing the top left corner of the image. The cat on the right is laying on its side with its head facing the bottom right corner of the image. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image. There is a red couch visible in the top left corner of the image.
==========
Prompt: 1051 tokens, 504.660 tokens-per-sec
Generation: 104 tokens, 39.375 tokens-per-sec
Peak memory: 86.818 GB
A top-down view of two brown and black striped cats laying on a pink blanket. The cat on the left is laying on its side with its head facing the top left corner of the image. The cat on the right is laying on its side with its head facing the bottom right corner of the image. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image. There is a red couch visible in the top left corner of the image.
Output generated in 5.30s
Memory used: 7.68 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-bf16 
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 17008.53it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 20610.83it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
A top-down view of two brown and black striped cats laying on a pink blanket. The cat on the left is laying on its side with its head facing the top left corner of the image. The cat on the right is laying on its side with its head facing the bottom right corner of the image. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image. There is a red couch visible in the top left corner of the image.
==========
Prompt: 1051 tokens, 474.687 tokens-per-sec
Generation: 104 tokens, 4.711 tokens-per-sec
Peak memory: 86.818 GB
A top-down view of two brown and black striped cats laying on a pink blanket. The cat on the left is laying on its side with its head facing the top left corner of the image. The cat on the right is laying on its side with its head facing the bottom right corner of the image. There is a gray remote control on the left side of the image and a gray remote control on the right side of the image. There is a red couch visible in the top left corner of the image.
Output generated in 24.84s
Memory used: 18.05 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-ft-docci-448-bf16 
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 29330.80it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 19043.38it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
A top-down view of two cats sleeping on a pink blanket. The cat on the left is a gray and black tabby cat, and it is lying on its side with its head facing the right and its tail sticking out to the left. Its front paws are stretched out in front of it. The cat on the right is lying on its side, and its head is down. It is brown and black and has its head turned to the right. Its front paws are stretched out in front of it. There are two white and blue remote controls between the cats.
==========
Prompt: 1051 tokens, 1291.438 tokens-per-sec
Generation: 115 tokens, 16.631 tokens-per-sec
Peak memory: 86.818 GB
A top-down view of two cats sleeping on a pink blanket. The cat on the left is a gray and black tabby cat, and it is lying on its side with its head facing the right and its tail sticking out to the left. Its front paws are stretched out in front of it. The cat on the right is lying on its side, and its head is down. It is brown and black and has its head turned to the right. Its front paws are stretched out in front of it. There are two white and blue remote controls between the cats.
Output generated in 8.29s
Memory used: 5.27 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-pt-896-4bit 
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 15973.95it/s]
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 43755.78it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
cat
==========
Prompt: 4123 tokens, 1278.715 tokens-per-sec
Generation: 2 tokens, 66.448 tokens-per-sec
Peak memory: 86.818 GB
cat
Output generated in 3.99s
Memory used: 1.64 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/pixtral-12b-8bit 
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 16194.22it/s]
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 13299.90it/s]
==========
Image: ['http://images.cocodataset.org/val2017/000000039769.jpg'] 

Prompt: <s>[INST][IMG]Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.[/INST]
**Caption:** Two Bengal cats lounging on a pink couch with remote controls.

**Description:** The image features two Bengal cats resting on a pink couch. One cat is lying on its back with its paws up in the air, while the other is lying on its side with its head resting on the armrest. Two remote controls are placed on the couch near the cats.

**Keywords:** Bengal cats, pink couch, lounging, remote controls, relaxed, armrest, paws up, side lying, domestic, indoor, comfort, feline, pets.
==========
Prompt: 1259 tokens, 424.747 tokens-per-sec
Generation: 117 tokens, 28.747 tokens-per-sec
Peak memory: 86.818 GB
**Caption:** Two Bengal cats lounging on a pink couch with remote controls.

**Description:** The image features two Bengal cats resting on a pink couch. One cat is lying on its back with its paws up in the air, while the other is lying on its side with its head resting on the armrest. Two remote controls are placed on the couch near the cats.

**Keywords:** Bengal cats, pink couch, lounging, remote controls, relaxed, armrest, paws up, side lying, domestic, indoor, comfort, feline, pets.
Output generated in 7.71s
Memory used: 12.58 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@jrp2014
Copy link
Author

jrp2014 commented Jan 2, 2025

Today's run, on

python check_models.py
mlx version: 0.21.1.dev20241231+8ecdfb718
mlx-vlm version: 0.1.9
The most recently modified file is: /Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running HuggingFaceTB/SmolVLM-Instruct
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 24093.66it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 20368.94it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|im_start|>User:<image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<end_of_utterance>
Assistant:
 A gingerbread house sits on a table. The house is decorated with candy and has a roof made of white icing. There are two lit sparklers on the roof.
==========
Prompt: 1582 tokens, 1100.960 tokens-per-sec
Generation: 36 tokens, 119.271 tokens-per-sec
Peak memory: 6.473 GB
 A gingerbread house sits on a table. The house is decorated with candy and has a roof made of white icing. There are two lit sparklers on the roof.
Output generated in 2.16s
Memory used: 5.10 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running OpenGVLab/InternVL2_5-8B
Fetching 21 files: 100%|█████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 11476.27it/s]
ERROR:root:Model type internvl_chat not supported.
Failed to load model or config at OpenGVLab/InternVL2_5-8B: Model type internvl_chat not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running cognitivecomputations/dolphin-2.9.2-qwen2-72b
Fetching 40 files: 100%|██████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 6974.81it/s]
ERROR:root:Model type qwen2 not supported.
Failed to load model or config at cognitivecomputations/dolphin-2.9.2-qwen2-72b: Model type qwen2 not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running distilbert/distilbert-base-uncased-finetuned-sst-2-english
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 13640.01it/s]
ERROR:root:Model type distilbert not supported.
Failed to load model or config at distilbert/distilbert-base-uncased-finetuned-sst-2-english: Model type distilbert not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running google/siglip-so400m-patch14-384
Fetching 6 files: 100%|█████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 8510.59it/s]
ERROR:root:Model type siglip not supported.
Failed to load model or config at google/siglip-so400m-patch14-384: Model type siglip not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running meta-llama/Llama-3.2-11B-Vision-Instruct
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 8767.36it/s]
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 24461.34it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image depicts a gingerbread house with a sparkler on top, sitting on a wooden table. The gingerbread house is decorated with various candies and sweets, including gumdrops, candy canes, and sprinkles.

**Description:**
The gingerbread house is made of gingerbread and has a rectangular base and a triangular roof. It is decorated with a variety of candies and sweets, including gumdrops, candy canes, and sprinkles. The house is placed on a wooden table, which is made of dark wood.

**Keywords:**
gingerbread house, sparkler, candy, sweets, gumdrops, candy canes, sprinkles, wooden table, dark wood, festive, holiday, Christmas, winter, sweet, colorful, decorative, edible, food, dessert, treat, celebration, party, special occasion, holiday season, winter wonderland, festive atmosphere, joyful, happy, fun, playful, creative, artistic, decorative, edible art, sweet treats, holiday treats, Christmas treats, winter treats, festive treats, holiday sweets, winter sweets, sweet delights, holiday delights, winter delights, festive delights, holiday joy, winter joy, happy holidays, joyful holidays, fun holidays, playful holidays, creative holidays, artistic holidays, decorative holidays, edible holidays
==========
Prompt: 36 tokens, 10.782 tokens-per-sec
Generation: 256 tokens, 3.695 tokens-per-sec
Peak memory: 31.499 GB
The image depicts a gingerbread house with a sparkler on top, sitting on a wooden table. The gingerbread house is decorated with various candies and sweets, including gumdrops, candy canes, and sprinkles.

**Description:**
The gingerbread house is made of gingerbread and has a rectangular base and a triangular roof. It is decorated with a variety of candies and sweets, including gumdrops, candy canes, and sprinkles. The house is placed on a wooden table, which is made of dark wood.

**Keywords:**
gingerbread house, sparkler, candy, sweets, gumdrops, candy canes, sprinkles, wooden table, dark wood, festive, holiday, Christmas, winter, sweet, colorful, decorative, edible, food, dessert, treat, celebration, party, special occasion, holiday season, winter wonderland, festive atmosphere, joyful, happy, fun, playful, creative, artistic, decorative, edible art, sweet treats, holiday treats, Christmas treats, winter treats, festive treats, holiday sweets, winter sweets, sweet delights, holiday delights, winter delights, festive delights, holiday joy, winter joy, happy holidays, joyful holidays, fun holidays, playful holidays, creative holidays, artistic holidays, decorative holidays, edible holidays
Output generated in 72.90s
Memory used: 18.42 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Florence-2-large-ft
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 18874.37it/s]
ERROR:root:No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Failed to load model or config at microsoft/Florence-2-large-ft: 
No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Create safetensors using the following code:

from transformers import AutoModelForCausalLM, AutoProcessor

model_id= "<huggingface_model_id>"
model = AutoModelForCausalLM.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

model.save_pretrained("<local_dir>")
processor.save_pretrained("<local_dir>")

Then use the <local_dir> as the --hf-path in the convert script.

python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>

        
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-mini-instruct
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 19431.91it/s]
ERROR:root:Model type phi3 not supported.
Failed to load model or config at microsoft/Phi-3.5-mini-instruct: Model type phi3 not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-vision-instruct
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 26462.49it/s]
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:524: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 16849.43it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|user|>
<|image_1|>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|end|>
<|assistant|>

Gingerbread house with sparklers, festive decorations, wooden table, indoor setting, holiday celebration, Christmas, New Year's Eve, sparkling lights, festive atmosphere, holiday spirit, decorative elements, gingerbread cookies, festive treats, holiday tradition, celebratory moment, holiday season, festive decoration, holiday cheer.<|end|>
==========
Prompt: 795 tokens, 889.567 tokens-per-sec
Generation: 85 tokens, 10.156 tokens-per-sec
Peak memory: 31.499 GB
Gingerbread house with sparklers, festive decorations, wooden table, indoor setting, holiday celebration, Christmas, New Year's Eve, sparkling lights, festive atmosphere, holiday spirit, decorative elements, gingerbread cookies, festive treats, holiday tradition, celebratory moment, holiday season, festive decoration, holiday cheer.<|end|>
Output generated in 9.52s
Memory used: 7.47 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mistral-community/pixtral-12b
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 28250.81it/s]
Failed to load model or config at mistral-community/pixtral-12b: Unsupported model type: pixtral
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Florence-2-large-ft-bf16
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 40233.13it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 31107.32it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
==========
Prompt: 29 tokens, 102.238 tokens-per-sec
Generation: 256 tokens, 169.543 tokens-per-sec
Peak memory: 31.499 GB
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
Output generated in 2.13s
Memory used: 1.68 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.2-11B-Vision-Instruct-8bit
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 17098.67it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 46040.66it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>


The image depicts a gingerbread house adorned with an assortment of candies and sweets, including gumdrops, candy canes, and chocolate sticks. The house is situated on a wooden table, and a sparkler is placed on top of it, adding a festive touch.

**Description:** A gingerbread house decorated with various candies and sweets, placed on a wooden table with a sparkler on top.

**Keywords:** Gingerbread house, Christmas, holiday, festive, candy, sweets, sparkler, wooden table, festive decoration.
==========
Prompt: 35 tokens, 10.807 tokens-per-sec
Generation: 106 tokens, 8.832 tokens-per-sec
Peak memory: 31.499 GB
The image depicts a gingerbread house adorned with an assortment of candies and sweets, including gumdrops, candy canes, and chocolate sticks. The house is situated on a wooden table, and a sparkler is placed on top of it, adding a festive touch.

**Description:** A gingerbread house decorated with various candies and sweets, placed on a wooden table with a sparkler on top.

**Keywords:** Gingerbread house, Christmas, holiday, festive, candy, sweets, sparkler, wooden table, festive decoration.
Output generated in 15.55s
Memory used: 10.79 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.3-70B-Instruct-8bit
Fetching 20 files: 100%|█████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 20641.26it/s]
ERROR:root:Model type llama not supported.
Failed to load model or config at mlx-community/Llama-3.3-70B-Instruct-8bit: Model type llama not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-8bit
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 27492.37it/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 17076.05it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
 A festive gingerbread house adorned with colorful decorations sits on a wooden table, complete with a miniature Christmas tree and candy border. The house features a roof covered in red and white icing, with "Merry Christmas" written in red. Fireworks explode above, adding a magical touch to the holiday scene. The image captures the warmth and joy of Christmas celebrations.

Gingerbread house, wooden table, Christmas decorations, candy, fireworks, festive scene, holiday spirit, red and white, "Merry Christmas"
==========
Prompt: 1001 tokens, 49.859 tokens-per-sec
Generation: 103 tokens, 41.161 tokens-per-sec
Peak memory: 31.499 GB
 A festive gingerbread house adorned with colorful decorations sits on a wooden table, complete with a miniature Christmas tree and candy border. The house features a roof covered in red and white icing, with "Merry Christmas" written in red. Fireworks explode above, adding a magical touch to the holiday scene. The image captures the warmth and joy of Christmas celebrations.

Gingerbread house, wooden table, Christmas decorations, candy, fireworks, festive scene, holiday spirit, red and white, "Merry Christmas"
Output generated in 22.93s
Memory used: 9.05 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-bf16
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 13252.15it/s]
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 25609.73it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
 A festive gingerbread house with a colorful roof and decorated windows sits on a wooden table, surrounded by candy canes and trees. Fireworks explode above, creating a magical holiday scene. The house is adorned with various candies and sprinkles, adding to its whimsical charm. The warm glow of the fireworks contrasts beautifully with the dark background, creating a captivating winter wonderland atmosphere.

Gingerbread house, fireworks, holiday, decorations, candies, wooden table, winter wonderland, warm glow, contrast, dark background, festive, magical, charming, colorful, windows, roof, candy canes, trees
==========
Prompt: 1001 tokens, 48.372 tokens-per-sec
Generation: 122 tokens, 26.152 tokens-per-sec
Peak memory: 36.669 GB
 A festive gingerbread house with a colorful roof and decorated windows sits on a wooden table, surrounded by candy canes and trees. Fireworks explode above, creating a magical holiday scene. The house is adorned with various candies and sprinkles, adding to its whimsical charm. The warm glow of the fireworks contrasts beautifully with the dark background, creating a captivating winter wonderland atmosphere.

Gingerbread house, fireworks, holiday, decorations, candies, wooden table, winter wonderland, warm glow, contrast, dark background, festive, magical, charming, colorful, windows, roof, candy canes, trees
Output generated in 25.67s
Memory used: 11.69 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Phi-3.5-vision-instruct-bf16
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 34100.03it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 32017.59it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|user|>
<|image_1|>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|end|>
<|assistant|>

Gingerbread house with sparklers, festive decorations, wooden table, indoor setting, holiday celebration, Christmas, New Year's Eve, sparkling lights, festive atmosphere, holiday spirit, decorative elements, gingerbread cookies, festive treats, holiday tradition, celebratory moment, holiday season, festive decoration, holiday cheer.<|end|>
==========
Prompt: 795 tokens, 916.819 tokens-per-sec
Generation: 85 tokens, 10.175 tokens-per-sec
Peak memory: 36.669 GB
Gingerbread house with sparklers, festive decorations, wooden table, indoor setting, holiday celebration, Christmas, New Year's Eve, sparkling lights, festive atmosphere, holiday spirit, decorative elements, gingerbread cookies, festive treats, holiday tradition, celebratory moment, holiday season, festive decoration, holiday cheer.<|end|>
Output generated in 9.45s
Memory used: 7.80 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/QVQ-72B-Preview-8bit
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 18315.74it/s]
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 14166.12it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Failed to generate output for model at mlx-community/QVQ-72B-Preview-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }
********************************************************************************

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Qwen2-VL-7B-Instruct-8bit
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 20205.40it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 46006.99it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

Failed to generate output for model at mlx-community/Qwen2-VL-7B-Instruct-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }
********************************************************************************

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/SmolVLM-Instruct-bf16
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 23442.78it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 26146.31it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|im_start|>User:<image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<end_of_utterance>
Assistant:
 A gingerbread house sits on a table. The house is decorated with candy and has a roof made of white icing. There are two lit sparklers on the roof.
==========
Prompt: 1582 tokens, 1074.756 tokens-per-sec
Generation: 36 tokens, 102.004 tokens-per-sec
Peak memory: 86.818 GB
 A gingerbread house sits on a table. The house is decorated with candy and has a roof made of white icing. There are two lit sparklers on the roof.
Output generated in 2.27s
Memory used: 0.98 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/deepseek-vl2-8bit
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 32340.42it/s]
Some kwargs in processor config are unused and will not have any effect: candidate_resolutions, image_token, pad_token, ignore_id, normalize, downsample_ratio, add_special_token, image_mean, sft_format, patch_size, mask_prompt, image_std. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 44186.35it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|User|>: <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.

<|Assistant|>:
Failed to generate output for model at mlx-community/deepseek-vl2-8bit: 
********************************************************************************

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/dolphin-vision-72b-4bit
Fetching 19 files: 100%|█████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 63958.09it/s]
Failed to load model or config at mlx-community/dolphin-vision-72b-4bit: TextConfig.__init__() missing 7 required positional arguments: 'model_type', 'hidden_size', 'num_hidden_layers', 'intermediate_size', 'num_attention_heads', 'rms_norm_eps', and 'vocab_size'
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/idefics2-8b-chatty-8bit
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 36604.83it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 41699.79it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: User: Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<image><end_of_utterance>
Assistant:
This image captures a delightful scene of a gingerbread house, adorned with a variety of colorful candies and sprinkles. The house, constructed from gingerbread, is topped with a chimney and a roof, adding to its charm. A sparkler is placed on the roof, casting a warm glow and creating a festive atmosphere. The house is situated on a wooden table, with a blurred background that draws focus to the gingerbread house. The image exudes a sense of celebration and joy, making it an ideal representation of holiday cheer.

Keywords: Gingerbread house, candies, sprinkles, sparkler, festive, holiday, celebration, joy, warmth, charm, wooden table, background, glow, atmosphere, festive, holiday, celebration, joy, warmth, charm.<end_of_utterance>
==========
Prompt: 105 tokens, 81.562 tokens-per-sec
Generation: 177 tokens, 48.203 tokens-per-sec
Peak memory: 86.818 GB
This image captures a delightful scene of a gingerbread house, adorned with a variety of colorful candies and sprinkles. The house, constructed from gingerbread, is topped with a chimney and a roof, adding to its charm. A sparkler is placed on the roof, casting a warm glow and creating a festive atmosphere. The house is situated on a wooden table, with a blurred background that draws focus to the gingerbread house. The image exudes a sense of celebration and joy, making it an ideal representation of holiday cheer.

Keywords: Gingerbread house, candies, sprinkles, sparkler, festive, holiday, celebration, joy, warmth, charm, wooden table, background, glow, atmosphere, festive, holiday, celebration, joy, warmth, charm.<end_of_utterance>
Output generated in 5.29s
Memory used: 8.44 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-34b-8bit
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 19402.22it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 15330.72it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <|im_start|>user
<image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.<|im_end|>
<|im_start|>assistant

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
Caption: A gingerbread house with a sparkler on top.

Description: The gingerbread house is decorated with candy and icing. It has a sparkler on top that is lit, creating a festive atmosphere.

Keywords: gingerbread house, sparkler, candy, icing, festive, holiday, decoration, dessert, cake, sweet, confectionery, edible, celebration, Christmas, winter, joy, light, warmth, cozy, home, family, tradition, sugar, sweet tooth, treat, indulgence, delicious, mouth-watering, appetizing, delectable, delightful, charming, whimsical, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful
==========
Prompt: 37 tokens, 8.758 tokens-per-sec
Generation: 256 tokens, 9.996 tokens-per-sec
Peak memory: 86.818 GB
Caption: A gingerbread house with a sparkler on top.

Description: The gingerbread house is decorated with candy and icing. It has a sparkler on top that is lit, creating a festive atmosphere.

Keywords: gingerbread house, sparkler, candy, icing, festive, holiday, decoration, dessert, cake, sweet, confectionery, edible, celebration, Christmas, winter, joy, light, warmth, cozy, home, family, tradition, sugar, sweet tooth, treat, indulgence, delicious, mouth-watering, appetizing, delectable, delightful, charming, whimsical, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful, delightful
Output generated in 30.25s
Memory used: 34.33 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-mistral-7b-8bit
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 46517.23it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18065.92it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: [INST] <image>
Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily. [/INST]
Factual Caption: A gingerbread house decorated with colorful candies and lit up with sparklers.

Description: The image showcases a festive gingerbread house, meticulously crafted and adorned with a variety of colorful candies. The house is topped with a chimney, and the roof is embellished with red and green candies. The walls of the house are decorated with white icing, and the windows are embellished with green candies. The house is placed on a wooden surface, and a string of sparklers is attached to the top, adding a touch of sparkle and festivity to the scene.

Keywords or Tags: gingerbread house, holiday decoration, festive, sparklers, candies, gingerbread, Christmas, holiday, craft, decoration, icing, wooden surface, colorful, embellished, embellishment, festive decoration, holiday decoration, sparklers, candies, gingerbread, Christmas, holiday, craft, decoration, icing, wooden surface, colorful, embellished, embellishment, festive decoration, holiday decoration, sparklers, candies, ginger
==========
Prompt: 42 tokens, 36.231 tokens-per-sec
Generation: 256 tokens, 46.937 tokens-per-sec
Peak memory: 86.818 GB
Factual Caption: A gingerbread house decorated with colorful candies and lit up with sparklers.

Description: The image showcases a festive gingerbread house, meticulously crafted and adorned with a variety of colorful candies. The house is topped with a chimney, and the roof is embellished with red and green candies. The walls of the house are decorated with white icing, and the windows are embellished with green candies. The house is placed on a wooden surface, and a string of sparklers is attached to the top, adding a touch of sparkle and festivity to the scene.

Keywords or Tags: gingerbread house, holiday decoration, festive, sparklers, candies, gingerbread, Christmas, holiday, craft, decoration, icing, wooden surface, colorful, embellished, embellishment, festive decoration, holiday decoration, sparklers, candies, gingerbread, Christmas, holiday, craft, decoration, icing, wooden surface, colorful, embellished, embellishment, festive decoration, holiday decoration, sparklers, candies, ginger
Output generated in 7.05s
Memory used: 5.59 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-6bit
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 15477.14it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 21156.64it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
A close-up view of a small gingerbread house placed on a dark brown wooden table. The gingerbread house is facing the left side of the image. The roof of the house is pointed, there is a white icing border going around the edge of the roof. There are red and white icing lines going across the roof. There are small white icing stars placed on the roof as well. There are small green, red, and blue icing stars placed on the roof as well. There is a white icing border going around the edge of the house. There are small white icing stars placed on the front of the house. There are small green icing trees placed on the front of the house. There is a white icing border going around the edge of the base of the house. There are small white icing stars placed on the base of the house. There are small green, red, and blue icing stars placed on the base of the house. There are small white icing stars placed on the base of the house. There is a white icing border going around the edge of the base of the house. There are small white icing stars placed on the base of the house. There is a white icing border going around the edge of the base of the house. There are small white icing stars placed
==========
Prompt: 1051 tokens, 485.065 tokens-per-sec
Generation: 256 tokens, 38.642 tokens-per-sec
Peak memory: 86.818 GB
A close-up view of a small gingerbread house placed on a dark brown wooden table. The gingerbread house is facing the left side of the image. The roof of the house is pointed, there is a white icing border going around the edge of the roof. There are red and white icing lines going across the roof. There are small white icing stars placed on the roof as well. There are small green, red, and blue icing stars placed on the roof as well. There is a white icing border going around the edge of the house. There are small white icing stars placed on the front of the house. There are small green icing trees placed on the front of the house. There is a white icing border going around the edge of the base of the house. There are small white icing stars placed on the base of the house. There are small green, red, and blue icing stars placed on the base of the house. There are small white icing stars placed on the base of the house. There is a white icing border going around the edge of the base of the house. There are small white icing stars placed on the base of the house. There is a white icing border going around the edge of the base of the house. There are small white icing stars placed
Output generated in 9.11s
Memory used: 7.43 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-bf16
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 20252.55it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 17719.92it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
A close-up view of a gingerbread house placed on a brown wooden table. The gingerbread house is facing the left side of the image. The roof of the house is pointed, there is a chimney on the left side of the house, and there is a window on the right side of the house. The house is decorated with white icing, candy, and sprinkles. There is a firework placed on top of the house. The firework is shooting sparks in all directions. The background of the image is blurry.
==========
Prompt: 1051 tokens, 471.758 tokens-per-sec
Generation: 103 tokens, 4.639 tokens-per-sec
Peak memory: 86.818 GB
A close-up view of a gingerbread house placed on a brown wooden table. The gingerbread house is facing the left side of the image. The roof of the house is pointed, there is a chimney on the left side of the house, and there is a window on the right side of the house. The house is decorated with white icing, candy, and sprinkles. There is a firework placed on top of the house. The firework is shooting sparks in all directions. The background of the image is blurry.
Output generated in 24.75s
Memory used: 17.97 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-ft-docci-448-bf16
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 26337.86it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 35734.22it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
A medium-close-up view of a gingerbread house that is being lit up by a firework. The house is made up of different-colored candies and is placed on a wooden table. The roof of the house is made up of white icing, and along the roof there are red and white candies that are shaped like snowflakes. The door of the house is made up of white icing, and along the door there are red and white circles that resemble windows. The house is being lit up by a firework that is shooting sparks in all directions. Behind the house, there is a blurry view of a couch that is made up of brown leather.
==========
Prompt: 1051 tokens, 1305.124 tokens-per-sec
Generation: 130 tokens, 16.214 tokens-per-sec
Peak memory: 86.818 GB
A medium-close-up view of a gingerbread house that is being lit up by a firework. The house is made up of different-colored candies and is placed on a wooden table. The roof of the house is made up of white icing, and along the roof there are red and white candies that are shaped like snowflakes. The door of the house is made up of white icing, and along the door there are red and white circles that resemble windows. The house is being lit up by a firework that is shooting sparks in all directions. Behind the house, there is a blurry view of a couch that is made up of brown leather.
Output generated in 9.14s
Memory used: 5.29 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-pt-896-4bit
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 27568.20it/s]
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 11805.44it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <image>Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.
christmas cake
==========
Prompt: 4123 tokens, 1285.691 tokens-per-sec
Generation: 3 tokens, 75.171 tokens-per-sec
Peak memory: 86.818 GB
christmas cake
Output generated in 3.59s
Memory used: 1.74 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/pixtral-12b-8bit
Fetching 11 files: 100%|████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 104857.60it/s]
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 30985.46it/s]
==========
Image: ['/Users/jrp/Pictures/Processed/20250101-214540_DSC01741.jpg'] 

Prompt: <s>[INST][IMG]Provide a factual caption, description and comma-separated keywords or tags for this image so that it can be searched for easily.[/INST]
This image features a festive gingerbread house adorned with various colorful candies and decorations. The house is topped with a sparkler, which is emitting bright sparks, adding a celebratory touch. The intricate details include candy canes, pretzels, and an assortment of sweet treats, creating a vibrant and joyful holiday scene.

Caption: "A beautifully decorated gingerbread house with a sparkling top, celebrating the festive season with colorful candies and treats."

Keywords: gingerbread house, sparkler, candies, festive, holiday, decorations, celebration, treats, colorful, joyful, festive season
==========
Prompt: 4189 tokens, 353.647 tokens-per-sec
Generation: 128 tokens, 25.956 tokens-per-sec
Peak memory: 86.818 GB
This image features a festive gingerbread house adorned with various colorful candies and decorations. The house is topped with a sparkler, which is emitting bright sparks, adding a celebratory touch. The intricate details include candy canes, pretzels, and an assortment of sweet treats, creating a vibrant and joyful holiday scene.

Caption: "A beautifully decorated gingerbread house with a sparkling top, celebrating the festive season with colorful candies and treats."

Keywords: gingerbread house, sparkler, candies, festive, holiday, decorations, celebration, treats, colorful, joyful, festive season
Output generated in 17.13s
Memory used: 12.64 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@jrp2014
Copy link
Author

jrp2014 commented Jan 5, 2025

Image

This is a hard image, as it is so non-descript, but some models make a reasonable stab, while others repeat nonsense. The mlx machinery is clearly working, as GPU usage is often pegged at 100%.

Llama-3.2-11B-Vision-Instruct-8bit is pretty good at describing the image, but generates some laughable keywords.

Molmo-7B-D-0924-8bit, idefics2-8b-chatty-8bit and pixtral-12b-8bit aren't bad.

% python check_models.py          
mlx version: 0.21.1.dev20250104+eab93985b
mlx-vlm version: 0.1.10
The most recently modified file is: /Users/x/Pictures/Processed/20250104-150453_DSC01821.jpg
Image dimensions: 8640x4860 (42.0 MPixels)
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running HuggingFaceTB/SmolVLM-Instruct
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 22143.27it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 29782.04it/s]
 Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace,
Output generated in 5.91s
Memory used: 5.16 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running OpenGVLab/InternVL2_5-8B
Fetching 21 files: 100%|█████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 19569.07it/s]
ERROR:root:Model type internvl_chat not supported.
Failed to load model or config at OpenGVLab/InternVL2_5-8B: Model type internvl_chat not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running cognitivecomputations/dolphin-2.9.2-qwen2-72b
Fetching 40 files: 100%|██████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 7638.16it/s]
ERROR:root:Model type qwen2 not supported.
Failed to load model or config at cognitivecomputations/dolphin-2.9.2-qwen2-72b: Model type qwen2 not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running distilbert/distilbert-base-uncased-finetuned-sst-2-english
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 18987.34it/s]
ERROR:root:Model type distilbert not supported.
Failed to load model or config at distilbert/distilbert-base-uncased-finetuned-sst-2-english: Model type distilbert not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running google/siglip-so400m-patch14-384
Fetching 6 files: 100%|█████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 4827.51it/s]
ERROR:root:Model type siglip not supported.
Failed to load model or config at google/siglip-so400m-patch14-384: Model type siglip not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running meta-llama/Llama-3.2-11B-Vision-Instruct
Fetching 15 files: 100%|██████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 5462.28it/s]
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 15992.52it/s]
The image depicts a large body of water with a docked boat in the foreground, surrounded by buildings and a cityscape in the background.

* A large body of water:
	+ The water is calm and peaceful
	+ It appears to be a river or a lake
	+ The water is grayish-blue in color
* A docked boat:
	+ The boat is white and blue in color
	+ It has a large cabin and a small deck
	+ There are several flags flying on the boat
* Buildings:
	+ There are several buildings along the waterfront
	+ They appear to be residential or commercial buildings
	+ They are made of brick or stone and have multiple stories
* A cityscape:
	+ The city is visible in the background
	+ It appears to be a small town or city
	+ There are trees and hills in the distance

The image suggests that the city is located near a body of water, possibly a river or a lake. The presence of a docked boat and buildings along the waterfront indicates that the city may have a strong maritime industry or a port.
Output generated in 66.07s
Memory used: 18.72 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Florence-2-large-ft
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29104.65it/s]
ERROR:root:No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Failed to load model or config at microsoft/Florence-2-large-ft: 
No safetensors found in /Users/jrp/.cache/huggingface/hub/models--microsoft--Florence-2-large-ft/snapshots/bb44b80c15e943b1bf7cec6e076359cec6e40178
Create safetensors using the following code:

from transformers import AutoModelForCausalLM, AutoProcessor

model_id= "<huggingface_model_id>"
model = AutoModelForCausalLM.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

model.save_pretrained("<local_dir>")
processor.save_pretrained("<local_dir>")

Then use the <local_dir> as the --hf-path in the convert script.

python -m mlx_vlm.convert --hf-path <local_dir> --mlx-path <mlx_dir>

        
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-mini-instruct
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 21950.87it/s]
ERROR:root:Model type phi3 not supported.
Failed to load model or config at microsoft/Phi-3.5-mini-instruct: Model type phi3 not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running microsoft/Phi-3.5-vision-instruct
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 39199.10it/s]
/opt/homebrew/Caskroom/miniconda/base/envs/mlx/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:524: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Fetching 14 files: 100%|█████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 24016.46it/s]
A view of a harbor with a large building, a crane, and several boats, including a red and white boat and a black and white boat. The sky is overcast, and the water is calm.<|end|>
Output generated in 5.70s
Memory used: 7.47 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mistral-community/pixtral-12b
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 46328.84it/s]
Failed to load model or config at mistral-community/pixtral-12b: Unsupported model type: pixtral
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Florence-2-large-ft-bf16
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 26024.64it/s]
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 8338.58it/s]
<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
Output generated in 3.68s
Memory used: 1.62 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.2-11B-Vision-Instruct-8bit
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 11175.87it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 52626.15it/s]
The image depicts a serene waterfront scene, featuring a large body of water in the foreground and a row of buildings along the shore.

**Description:**
The image showcases a tranquil waterfront scene, with a large body of water occupying the foreground. The water's surface is calm, reflecting the surrounding environment. In the background, a row of buildings lines the shore, with a mix of white and brown structures, some of which appear to be residential or commercial properties. A few trees are visible behind the buildings, adding a touch of greenery to the scene.

**Keywords:**
waterfront, buildings, water, calm, serene, peaceful, harbor, dock, pier, boats, ships, vessels, nautical, maritime, transportation, travel, tourism, leisure, recreation, relaxation, scenic, natural, environment, nature, landscape, scenery, view, perspective, horizon, sky, clouds, weather, climate, season, time, day, night, morning, evening, sunset, sunrise, fog, mist, haze, atmosphere, mood, feeling, emotion, sentiment, tone, style, theme, genre, art, photography, composition, lighting, color, texture, pattern, shape, form, structure, architecture, design, style, trend, fashion, culture, society, community, people, activities, events, celebrations, festivals, holidays, traditions, customs, rituals, practices, beliefs, values, principles, ethics, morality, philosophy, religion, spirituality, faith, hope, love, kindness, compassion, empathy, understanding, tolerance, acceptance, inclusivity, diversity, equality, justice, fairness, freedom, human rights, dignity, respect, care, concern, support, help, assistance, resources, services, infrastructure, technology, innovation, progress, development, growth, improvement, success, achievement, recognition, reward, celebration, accomplishment, milestone, goal, target, plan, strategy, vision, mission, leadership, management, organization, team, collaboration, communication, cooperation, conflict, resolution, negotiation, mediation, arbitration, dispute, resolution, settlement, agreement, contract, partnership, alliance, cooperation, collaboration, competition, rivalry, challenge, opportunity, risk, threat, danger, crisis, emergency, disaster, response, recovery, resilience, adaptability, flexibility, innovation, creativity, problem-solving, critical thinking, decision-making, analysis, evaluation, research, development, innovation, entrepreneurship, business, economy, finance, investment, trade, commerce, industry, manufacturing
Output generated in 63.14s
Memory used: 10.78 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Llama-3.3-70B-Instruct-8bit
Fetching 20 files: 100%|█████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 19939.64it/s]
ERROR:root:Model type llama not supported.
Failed to load model or config at mlx-community/Llama-3.3-70B-Instruct-8bit: Model type llama not supported.
================================================================================

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-8bit
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 13400.33it/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 18819.09it/s]
 A waterfront scene featuring a large white building with a brown roof, likely a hotel or apartment complex, and a smaller building with a blue roof. Several boats are docked in front, including a red and black tugboat and a blue and white ferry. The sky is overcast, and trees are visible in the background. A tall metal tower stands on the right side of the image.

Keywords: waterfront, boats, building, hotel, ferry, tugboat, trees, overcast, sky, tower
Output generated in 38.89s
Memory used: 9.40 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Molmo-7B-D-0924-bf16
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 13233.56it/s]
Fetching 18 files: 100%|█████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 17928.63it/s]
 A picturesque waterfront scene featuring a large body of water with boats docked in the foreground, including a prominent orange and black vessel and a blue and white boat. Behind the waterfront, there's a mix of buildings, with a notable three-story white structure on the left and a larger brown building with a gray roof on the right. The background showcases a hillside dotted with houses and trees, under a gray sky. The image captures a serene urban landscape with a blend of natural and man-made elements.

Keywords: waterfront, boats, buildings, hillside, houses, trees, gray sky, urban landscape
Output generated in 40.39s
Memory used: 11.76 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Phi-3.5-vision-instruct-bf16
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 42498.79it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 14509.30it/s]
A view of a harbor with a large building, a crane, and several boats, including a red and white boat and a black and white boat. The sky is overcast, and the water is calm.<|end|>
Output generated in 5.67s
Memory used: 7.80 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/QVQ-72B-Preview-8bit
Fetching 25 files: 100%|██████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 7012.95it/s]
Fetching 25 files: 100%|██████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 9971.24it/s]
Failed to generate output for model at mlx-community/QVQ-72B-Preview-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }
********************************************************************************

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/Qwen2-VL-7B-Instruct-8bit
Fetching 12 files: 100%|████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 249166.57it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 84449.07it/s]
Failed to generate output for model at mlx-community/Qwen2-VL-7B-Instruct-8bit: arange(): incompatible function arguments. The following argument types are supported:
    1. arange(start : Union[int, float], stop : Union[int, float], step : Union[None, int, float], dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array
    2. arange(stop : Union[int, float], step : Union[None, int, float] = None, dtype: Optional[Dtype] = None, *, stream: Union[None, Stream, Device] = None) -> array

Invoked with types: mlx.core.array, kwargs = { dtype: mlx.core.Dtype }
********************************************************************************

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/SmolVLM-Instruct-bf16
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 23674.34it/s]
Some kwargs in processor config are unused and will not have any effect: image_seq_len. 
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 22211.67it/s]
 Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace, Royal Terrace,
Output generated in 5.98s
Memory used: 1.13 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/deepseek-vl2-8bit
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 24126.53it/s]
Some kwargs in processor config are unused and will not have any effect: ignore_id, image_mean, add_special_token, normalize, downsample_ratio, candidate_resolutions, patch_size, image_token, sft_format, image_std, mask_prompt, pad_token. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 22102.13it/s]
The image shows a waterfront scene with a mix of modern and older buildings. Prominently, there is a large white building with a clock tower and a sign that reads "Royal Terrace Hotel." In the foreground, there is a body of water with a small boat labeled "PILOTS" and "SURVEY" moored at a dock. The background features a hilly area with more buildings and a mix of trees and open space. The overall atmosphere is overcast, and the image has a calm, serene feel to it.
Output generated in 7.50s
Memory used: 27.36 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/dolphin-vision-72b-4bit
Fetching 19 files: 100%|█████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 37308.88it/s]
Fetching 19 files: 100%|█████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 10833.57it/s]
The image can be captioned as "A Quiet Harbor on a Foggy Day." The description would be: "A serene harbor scene with various boats docked, surrounded by residential buildings and a hill in the background, under a foggy sky." The keywords or tags for this image could include: "harbor," "boats," "residential buildings," "foggy day," "overcast sky," "waterfront," "peaceful," "nautical," "coastal living," "marine life," and "landscape."
Output generated in 26.58s
Memory used: 31.23 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/idefics2-8b-chatty-8bit
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 40787.40it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 24480.37it/s]
This image captures a serene harbor scene. The main focus is a row of white buildings, each adorned with a red roof, that line the waterfront. These buildings, with their distinctive color scheme, stand out against the backdrop of the harbor. 

In the foreground, several boats are docked, their presence adding a touch of life to the otherwise tranquil setting. The water of the harbor is a deep blue-gray color, reflecting the overcast sky above. 

The sky itself is filled with clouds, suggesting an overcast day. Despite this, the scene is far from gloomy. The calm water and the quiet harbor create a sense of peace and tranquility.

In terms of object count, there are multiple boats and buildings visible in the image. The boats are docked in the foreground, while the buildings line the waterfront.

The relative positions of the objects are such that the boats are closer to the viewer than the buildings. The buildings are arranged in a row along the waterfront, while the boats are scattered throughout the foreground.

This image is a beautiful representation of a quiet harbor scene, with its white buildings, red-roofed buildings, and docked boats. It's a snapshot of a moment of calm and tranquility in an otherwise bustling world.

Comma-separated keywords or tags:
harbor, boats, waterfront, buildings, overcast, sky, water, boats, docked, calm, tranquil, reflection, sky, clouds, peace, quiet, reflection, waterfront, buildings, white, red roof, docked, boats, harbor, moment, calm, tranquil, snapshot, world, bustling, quiet, harbor, boats, waterfront, buildings, white, red roof, docked, boats, harbor, moment, calm, tranquil, snapshot, world, bustling, quiet, harbor, boats, waterfront, buildings, white, red roof, docked, boats, harbor, moment, calm, tranquil, snapshot, world, bustling, quiet, harbor, boats, waterfront, buildings, white, red roof, docked, boats, harbor, moment, calm, tranquil, snapshot, world, bustling, quiet, harbor, boats,
Output generated in 11.97s
Memory used: 7.72 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-34b-8bit
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 41771.04it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 21221.18it/s]
Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
Caption: A serene harbor scene with boats and buildings.

Description: The image captures a tranquil harbor scene with two boats docked at a pier. The boats, one red and one blue, are moored next to a white building with a gray roof. The building is situated on the left side of the image, while the boats are on the right. The calm water of the harbor reflects the clear blue sky above. In the distance, a hill dotted with trees can be seen, adding a touch of nature to the urban landscape.

Keywords: harbor, boats, buildings, pier, water, sky, hill, trees, tranquil, urban, nature, reflection, distance, clear, blue, red, white, gray, calm, moored, docked, serene, scene, image, peaceful, cityscape, architecture, landscape, natural, man-made, structures, vessels, dock, peaceful, serene, calm, tranquil, urban, nature, reflection, distance, clear, blue, red, white, gray, calm, moored, docked, serene, scene, image, peaceful, cityscape, architecture, landscape, natural, man-made, structures, vessels, dock, peaceful, serene, calm, tranquil, urban, nature, reflection, distance, clear, blue, red, white, gray, calm, moored, docked, serene, scene, image, peaceful, cityscape, architecture, landscape, natural, man-made, structures, vessels, dock, peaceful, serene, calm, tranquil, urban, nature, reflection, distance, clear, blue, red, white, gray, calm, moored, docked, serene, scene, image, peaceful, cityscape, architecture, landscape, natural, man-made, structures, vessels, dock, peaceful, serene, calm, tranquil, urban, nature, reflection, distance, clear, blue, red, white, gray, calm, moored, docked, serene, scene, image, peaceful, cityscape, architecture, landscape, natural, man-made, structures, vessels, dock, peaceful, serene, calm, tranquil, urban, nature, reflection, distance, clear, blue, red, white, gray, calm, moored, docked, serene, scene, image, peaceful, cityscape, architecture, landscape, natural, man
Output generated in 54.86s
Memory used: 34.41 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/llava-v1.6-mistral-7b-8bit
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 23215.70it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 38362.54it/s]
Factual Caption: A serene harbor scene with a mix of residential and commercial buildings.

Description: The image depicts a tranquil harbor with a variety of buildings, including residential houses and commercial structures. The water is calm, reflecting the overcast sky. A boat is docked at the pier, and there are a few people visible on the pier. The buildings are predominantly white with some brown roofs, and the overall atmosphere is peaceful and quiet.

Keywords or Tags: harbor, residential houses, commercial buildings, boat, pier, overcast sky, calm water, people, tranquil, serene, reflection, buildings, water, dock, pier, boat dock, harbor scene, residential area, commercial area, calm harbor, overcast sky, reflection on water, people on pier, boat in harbor, harbor scene, residential houses by water, commercial buildings by water, calm harbor by buildings, overcast sky by buildings, reflection on water by buildings, people on pier by buildings, boat in harbor by buildings, harbor scene by buildings, residential area by water, commercial area by water, calm harbor by buildings, overcast sky by buildings, reflection on water by buildings, people on pier by buildings, boat in harbor by buildings, harbor scene by buildings, residential houses by water, commercial buildings by water, calm harbor by buildings, overcast sky by buildings, reflection on water by buildings, people on pier by buildings, boat in harbor by buildings, harbor scene by buildings, residential area by water, commercial area by water, calm harbor by buildings, overcast sky by buildings, reflection on water by buildings, people on pier by buildings, boat in harbor by buildings, harbor scene by buildings, residential houses by water, commercial buildings by water, calm harbor by buildings, overcast sky by buildings, reflection on water by buildings, people on pier by buildings, boat in harbor by buildings, harbor scene by buildings, residential area by water, commercial area by water, calm harbor by buildings, overcast sky by buildings, reflection on water by buildings, people on pier by buildings, boat in harbor by buildings, harbor scene by buildings, residential houses by water, commercial buildings by water, calm harbor by buildings, overcast sky by buildings, reflection on water by buildings, people on pier by buildings,
Output generated in 12.22s
Memory used: 6.15 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-6bit
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 16594.67it/s]
Fetching 8 files: 100%|███████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 111476.52it/s]
A long shot view of a body of water with a large building on the right side of the water and a smaller building to the left of the water. A blue and white boat is docked in front of the smaller building. A large hill is in the background behind the buildings. A large number of trees are on the hill. A large number of houses are in front of the hill. A large number of windows are on the buildings. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of trees are in front of the houses. A large number of windows are on the houses. A large number of trees are in front of the windows. A large number of
Output generated in 15.53s
Memory used: 7.47 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-10b-ft-docci-448-bf16
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 29330.80it/s]
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 25010.76it/s]
A long shot view of a body of water with a large white building on the right side of the water and a smaller building to the left of it. A blue and white boat is docked in front of the white building. A small red boat is to the left of the blue boat. A hill is in the background behind the buildings. The sky is gray.
Output generated in 18.53s
Memory used: 18.01 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-ft-docci-448-bf16
Fetching 8 files: 100%|███████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 190650.18it/s]
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 16296.47it/s]
A long shot view of a body of water with a large white building on the right side of the water, the building has a large archway at the top of it. To the left of the building is a smaller white building, and to the left of the smaller building is a large blue and white boat. Behind the buildings is a large hill that is made up of trees, the sky is a light gray color and is full of clouds.
Output generated in 6.82s
Memory used: 5.37 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/paligemma2-3b-pt-896-4bit
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 37259.05it/s]
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 17079.77it/s]
pilot
Output generated in 3.69s
Memory used: 1.61 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Running mlx-community/pixtral-12b-8bit
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 15645.08it/s]
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 18396.07it/s]
**Caption:** A serene waterfront scene featuring a docked boat and industrial buildings against a backdrop of hills.

**Description:** The image depicts a calm waterfront with a large boat docked near a row of industrial buildings. The buildings have a mix of architectural styles, with some featuring modern designs and others showcasing more traditional elements. In the background, a hillside dotted with houses and trees stretches across the horizon, under a cloudy sky. The overall atmosphere is peaceful and slightly overcast.

**Keywords:** Waterfront, boat, dock, industrial buildings, hills, houses, trees, cloudy sky, peaceful, overcast, water, calm.
Output generated in 11.38s
Memory used: 12.75 GB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants