The inference output is meaningless #17

yydxlv · 2024-12-23T03:23:09Z

When i input an image , the inference output is repetitive and not well-organized.

`conversation = [
{
"role": "<|User|>",
"content": "\n Describe the image.",
"images": ["../data/cat.png"],
},
{"role": "<|Assistant|>", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True, system_prompt="Describe the image.").to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

outputs = vl_gpt.language.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
do_sample=False,
use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)`

The output is follow:
Describe the image.

<|User|>:
Describe the image.

<|Assistant|>: The image shows a cat wearing a shirt with a shirt with a design that resembles a shirt with a design that resembles a shirt with a design that resembles a shirt with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with

HubHop · 2024-12-23T05:11:50Z

Hi @yydxlv, can you let me know which model you are testing? If possible, can you also share this failure case (image) with us?

yydxlv · 2024-12-23T06:00:00Z

Hi @yydxlv, can you let me know which model you are testing? If possible, can you also share this failure case (image) with us?

The model name is deepseek-vl2-small

I use the following code to load the model

`import torch
from transformers import AutoModelForCausalLM
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images
from transformers import AutoProcessor

model_name = "models/deepseek-vl2-small"

vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_name)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(torch.bfloat16).cuda().eval()`

The input img is

Thank you for help

HubHop · 2024-12-23T06:22:04Z

Hi @yydxlv, I tested it with DeepSeek-VL2-Small and your image. The results seem to be good:

<|User|>: <image>
 Describe the image.

<|Assistant|>: The image shows a cat wearing a shirt with various technical diagrams and text related to machine learning and neural networks. The shirt has words like "TRANSFORMER," "ATTENTION MECHANISM," "ENCODER," and "DECODER." The cat is sitting on a rug in a room with a wooden chair and a bookshelf in the background.<｜end▁of▁sentence｜>

Please try our demo code in the readme and directly replace the prompts/images, to see if the results are correct or not.

conversation = [
{
"role": "<|User|>",
"content": "<image>\n Describe the image.",
"images": ["./images/cat.png"],
},
{"role": "<|Assistant|>", "content": ""},
]


# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=128,
    do_sample=False,
    use_cache=True
)

yydxlv · 2024-12-23T06:48:40Z

Yeah, I use the same code as the demo, but the result is strange.

Is there anything wrong with me ?

HubHop · 2024-12-23T07:01:48Z

The code looks good to me. I'm not entirely sure but you may check if it is due to different versions of python packages. Below is our testing environment.

torch==2.0.1
transformers==4.38.2

yydxlv · 2024-12-23T08:24:24Z

The code looks good to me. I'm not entirely sure but you may check if it is due to different versions of python packages. Below is our testing environment.
torch==2.0.1
transformers==4.38.2

Yeah, i have just reinstalled and verified torch==2.0.1 and transformers==4.38.2.,
the model is runed on A800, but the output is still the same like this

"
<|User|>:
Describe the image.

<|Assistant|>: The image shows a cat wearing a shirt with a shirt with a design that resembles a transformer. The shirt has a transformer design on it. The shirt has a transformer. The shirt has a transformer. The shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt
"

Blaizzy · 2024-12-23T10:21:05Z

You need to pass input_ids.

There is a PR addressing this issue #11

MrRace · 2024-12-24T06:55:56Z

It was truly disappointing. Nothing is perfect everywhere. Maybe it's time to let go!

tracyCzf · 2024-12-27T06:10:13Z

@yydxlv @HubHop Using the same code, deepseek-vl2-small generated the same bad result with yydxlv. But deepssek-vl2-tiny and deepssek-vl2 generated results looks good.

p1x33l · 2025-01-08T12:59:24Z

+1

Blaizzy · 2025-01-08T13:08:25Z

I can't get deepseek tiny to work well on MLX

p1x33l · 2025-01-08T13:22:06Z

i had the same issue, i extracted the code used in inference.py file and it works fine

from typing import List, Dict
import torch
from transformers import AutoModelForCausalLM
import PIL.Image
from deepseek_vl2.models import DeepseekVLV2ForCausalLM, DeepseekVLV2Processor
from deepseek_vl2.serve.app_modules.utils import parse_ref_bbox


def load_pil_images(conversations: List[Dict[str, str]]) -> List[PIL.Image.Image]:
    pil_images = []
    for message in conversations:
        if "images" not in message:
            continue

        for image_path in message["images"]:
            pil_img = PIL.Image.open(image_path)
            pil_img = pil_img.convert("RGB")
            pil_images.append(pil_img)
    return pil_images


def main(conversation, model_path="deepseek-ai/deepseek-vl2-tiny", chunk_size=512):
    
    dtype = torch.bfloat16

    vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
    tokenizer = vl_chat_processor.tokenizer

    vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(
        model_path,
        trust_remote_code=True,
        torch_dtype=dtype
    )
    vl_gpt = vl_gpt.cuda().eval()

    # load images and prepare for inputs
    pil_images = load_pil_images(conversation)

    prepare_inputs = vl_chat_processor.__call__(
        conversations=conversation,
        images=pil_images,
        force_batchify=True,
        system_prompt=""
    ).to(vl_gpt.device, dtype=dtype)

    with torch.no_grad():

        if chunk_size == -1:
            inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
            past_key_values = None
        else:
            # incremental_prefilling when using 40G GPU for vl2-small
            inputs_embeds, past_key_values = vl_gpt.incremental_prefilling(
                input_ids=prepare_inputs.input_ids,
                images=prepare_inputs.images,
                images_seq_mask=prepare_inputs.images_seq_mask,
                images_spatial_crop=prepare_inputs.images_spatial_crop,
                attention_mask=prepare_inputs.attention_mask,
                chunk_size=chunk_size
            )

        # run the model to get the response
        outputs = vl_gpt.generate(
            inputs_embeds=inputs_embeds,
            input_ids=prepare_inputs.input_ids,
            images=prepare_inputs.images,
            images_seq_mask=prepare_inputs.images_seq_mask,
            images_spatial_crop=prepare_inputs.images_spatial_crop,
            attention_mask=prepare_inputs.attention_mask,
            past_key_values=past_key_values,

            pad_token_id=tokenizer.eos_token_id,
            bos_token_id=tokenizer.bos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            max_new_tokens=512,

            do_sample=True,
            temperature=0.4,
            top_p=0.9,
            repetition_penalty=1.1,

            use_cache=True,
        )

        answer = tokenizer.decode(outputs[0][len(prepare_inputs.input_ids[0]):].cpu().tolist(), skip_special_tokens=False)
        print(f"{prepare_inputs['sft_format'][0]}", answer)

        vg_image = parse_ref_bbox(answer, image=pil_images[-1])
        if vg_image is not None:
            vg_image.save("./vg.jpg", format="JPEG", quality=85)

conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|grounding|>Describe the image.",
        "images": [
            "cat.png",
        ],
    },
    {"role": "<|Assistant|>", "content": ""},
]

main(conversation)

Blaizzy · 2025-01-08T13:35:19Z

Is this for me @p1x33l?

Blaizzy · 2025-01-08T13:51:51Z

Testing prince-canuma--deepseek-vl2-tiny ───────────────────────────────────────────────

Some kwargs in processor config are unused and will not have any effect: add_special_token, image_token, pad_token, image_mean, sft_format, normalize, ignore_id, candidate_resolutions, image_std, mask_prompt, patch_size, downsample_ratio. 
Add pad token = ['<｜▁pad▁｜>'] to the tokenizer
<｜▁pad▁｜>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
✓ Model loaded successfully in 8.29 seconds


Testing vision-language generation...
==========
Image: ['visual_grounding.jpeg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
 Snikas in the end
 ==========
 ✓ vision-language generation successful

Testing language-only generation...
==========
Image: None 

Prompt: <|User|>: Hi, how are you?

<|Assistant|>:
I'm DeepSeek-VL, an intelligent assistant, I can recognize images and provide services such as a virtual assistant, I am DeepSeek-VL, an intelligent assistant developed by DeepSeek-5433Dentirelindisthephat one, a question and I am DeepSeekTextual.
==========
✓ language-only generation successful

Blaizzy · 2025-01-08T13:54:20Z

Compared to deepseek-small and deepseek-vl2.

Testing mlx-community/deepseek-vl2-6bit ────────────────────────────────────────────────╯

Loading model...
Some kwargs in processor config are unused and will not have any effect: add_special_token, image_token, pad_token, image_mean, sft_format, normalize, ignore_id, candidate_resolutions, image_std, mask_prompt, patch_size, downsample_ratio. 
Add pad token = ['<｜▁pad▁｜>'] to the tokenizer
<｜▁pad▁｜>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
✓ Model loaded successfully in 24.59 seconds


Testing vision-language generation...
==========
Image: ['visual_grounding.jpeg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Two giraffes standing on what appears to be an open grassy plain or savannah-like environment during daylight hours. The foreground shows a taller giraffe facing leftward towards the camera; it has long necks adorned with distinctive brown patches separated by lighter lines of fur that cover most of its body except for white legs ending in black hooves. Its head is turned slightly to reveal both ears perked up attentively. Behind the first giraffe stands another similar individual facing rightward but looking over its shoulder
==========
Prompt: 642 tokens, 196.790 tokens-per-sec
Generation: 100 tokens, 41.571 tokens-per-sec
Peak memory: 23.265 GB
✓ vision-language generation successful

Testing language-only generation...
==========
Image: None 

Prompt: <|User|>: Hi, how are you?

<|Assistant|>:
Hello! I'm just a digital AI assistant, so I don't have feelings or the ability to 'be there,' but I'm here to help you with anything you need! How can I assist you today?
==========
Prompt: 11 tokens, 115.140 tokens-per-sec
Generation: 44 tokens, 53.325 tokens-per-sec
Peak memory: 22.945 GB
✓ language-only generation successful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The inference output is meaningless #17

The inference output is meaningless #17

yydxlv commented Dec 23, 2024

HubHop commented Dec 23, 2024

yydxlv commented Dec 23, 2024

HubHop commented Dec 23, 2024

yydxlv commented Dec 23, 2024 •

edited

Loading

HubHop commented Dec 23, 2024

yydxlv commented Dec 23, 2024

Blaizzy commented Dec 23, 2024

MrRace commented Dec 24, 2024

tracyCzf commented Dec 27, 2024

p1x33l commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

p1x33l commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

The inference output is meaningless #17

The inference output is meaningless #17

Comments

yydxlv commented Dec 23, 2024

HubHop commented Dec 23, 2024

yydxlv commented Dec 23, 2024

HubHop commented Dec 23, 2024

yydxlv commented Dec 23, 2024 • edited Loading

HubHop commented Dec 23, 2024

yydxlv commented Dec 23, 2024

Blaizzy commented Dec 23, 2024

MrRace commented Dec 24, 2024

tracyCzf commented Dec 27, 2024

p1x33l commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

p1x33l commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

Blaizzy commented Jan 8, 2025

yydxlv commented Dec 23, 2024 •

edited

Loading