Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The inference output is meaningless #17

Open
yydxlv opened this issue Dec 23, 2024 · 15 comments
Open

The inference output is meaningless #17

yydxlv opened this issue Dec 23, 2024 · 15 comments

Comments

@yydxlv
Copy link

yydxlv commented Dec 23, 2024

When i input an image , the inference output is repetitive and not well-organized.

`conversation = [
{
"role": "<|User|>",
"content": "\n Describe the image.",
"images": ["../data/cat.png"],
},
{"role": "<|Assistant|>", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True, system_prompt="Describe the image.").to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

outputs = vl_gpt.language.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
do_sample=False,
use_cache=True
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)`

The output is follow:
Describe the image.

<|User|>:
Describe the image.

<|Assistant|>: The image shows a cat wearing a shirt with a shirt with a design that resembles a shirt with a design that resembles a shirt with a design that resembles a shirt with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with

@HubHop
Copy link
Collaborator

HubHop commented Dec 23, 2024

Hi @yydxlv, can you let me know which model you are testing? If possible, can you also share this failure case (image) with us?

@yydxlv
Copy link
Author

yydxlv commented Dec 23, 2024

Hi @yydxlv, can you let me know which model you are testing? If possible, can you also share this failure case (image) with us?

The model name is deepseek-vl2-small

I use the following code to load the model

`import torch
from transformers import AutoModelForCausalLM
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images
from transformers import AutoProcessor

model_name = "models/deepseek-vl2-small"

vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_name)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(torch.bfloat16).cuda().eval()`

The input img is

cat

Thank you for help

@HubHop
Copy link
Collaborator

HubHop commented Dec 23, 2024

Hi @yydxlv, I tested it with DeepSeek-VL2-Small and your image. The results seem to be good:

<|User|>: <image>
 Describe the image.

<|Assistant|>: The image shows a cat wearing a shirt with various technical diagrams and text related to machine learning and neural networks. The shirt has words like "TRANSFORMER," "ATTENTION MECHANISM," "ENCODER," and "DECODER." The cat is sitting on a rug in a room with a wooden chair and a bookshelf in the background.<|end▁of▁sentence|>

Please try our demo code in the readme and directly replace the prompts/images, to see if the results are correct or not.

conversation = [
{
"role": "<|User|>",
"content": "<image>\n Describe the image.",
"images": ["./images/cat.png"],
},
{"role": "<|Assistant|>", "content": ""},
]


# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=128,
    do_sample=False,
    use_cache=True
)

@yydxlv
Copy link
Author

yydxlv commented Dec 23, 2024

Yeah, I use the same code as the demo, but the result is strange.

1734936402721

1734936431925

Is there anything wrong with me ?

@HubHop
Copy link
Collaborator

HubHop commented Dec 23, 2024

The code looks good to me. I'm not entirely sure but you may check if it is due to different versions of python packages. Below is our testing environment.

torch==2.0.1
transformers==4.38.2

@yydxlv
Copy link
Author

yydxlv commented Dec 23, 2024

The code looks good to me. I'm not entirely sure but you may check if it is due to different versions of python packages. Below is our testing environment.

torch==2.0.1
transformers==4.38.2

Yeah, i have just reinstalled and verified torch==2.0.1 and transformers==4.38.2.,
the model is runed on A800, but the output is still the same like this

"
<|User|>:
Describe the image.

<|Assistant|>: The image shows a cat wearing a shirt with a shirt with a design that resembles a transformer. The shirt has a transformer design on it. The shirt has a transformer. The shirt has a transformer. The shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt shirt
"

@Blaizzy
Copy link

Blaizzy commented Dec 23, 2024

You need to pass input_ids.

There is a PR addressing this issue #11

@MrRace
Copy link

MrRace commented Dec 24, 2024

It was truly disappointing. Nothing is perfect everywhere. Maybe it's time to let go!

@tracyCzf
Copy link

@yydxlv @HubHop Using the same code, deepseek-vl2-small generated the same bad result with yydxlv. But deepssek-vl2-tiny and deepssek-vl2 generated results looks good.

@p1x33l
Copy link

p1x33l commented Jan 8, 2025

+1

@Blaizzy
Copy link

Blaizzy commented Jan 8, 2025

I can't get deepseek tiny to work well on MLX

@p1x33l
Copy link

p1x33l commented Jan 8, 2025

i had the same issue, i extracted the code used in inference.py file and it works fine

from typing import List, Dict
import torch
from transformers import AutoModelForCausalLM
import PIL.Image
from deepseek_vl2.models import DeepseekVLV2ForCausalLM, DeepseekVLV2Processor
from deepseek_vl2.serve.app_modules.utils import parse_ref_bbox


def load_pil_images(conversations: List[Dict[str, str]]) -> List[PIL.Image.Image]:
    pil_images = []
    for message in conversations:
        if "images" not in message:
            continue

        for image_path in message["images"]:
            pil_img = PIL.Image.open(image_path)
            pil_img = pil_img.convert("RGB")
            pil_images.append(pil_img)
    return pil_images


def main(conversation, model_path="deepseek-ai/deepseek-vl2-tiny", chunk_size=512):
    
    dtype = torch.bfloat16

    vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
    tokenizer = vl_chat_processor.tokenizer

    vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(
        model_path,
        trust_remote_code=True,
        torch_dtype=dtype
    )
    vl_gpt = vl_gpt.cuda().eval()

    # load images and prepare for inputs
    pil_images = load_pil_images(conversation)

    prepare_inputs = vl_chat_processor.__call__(
        conversations=conversation,
        images=pil_images,
        force_batchify=True,
        system_prompt=""
    ).to(vl_gpt.device, dtype=dtype)

    with torch.no_grad():

        if chunk_size == -1:
            inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
            past_key_values = None
        else:
            # incremental_prefilling when using 40G GPU for vl2-small
            inputs_embeds, past_key_values = vl_gpt.incremental_prefilling(
                input_ids=prepare_inputs.input_ids,
                images=prepare_inputs.images,
                images_seq_mask=prepare_inputs.images_seq_mask,
                images_spatial_crop=prepare_inputs.images_spatial_crop,
                attention_mask=prepare_inputs.attention_mask,
                chunk_size=chunk_size
            )

        # run the model to get the response
        outputs = vl_gpt.generate(
            inputs_embeds=inputs_embeds,
            input_ids=prepare_inputs.input_ids,
            images=prepare_inputs.images,
            images_seq_mask=prepare_inputs.images_seq_mask,
            images_spatial_crop=prepare_inputs.images_spatial_crop,
            attention_mask=prepare_inputs.attention_mask,
            past_key_values=past_key_values,

            pad_token_id=tokenizer.eos_token_id,
            bos_token_id=tokenizer.bos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            max_new_tokens=512,

            do_sample=True,
            temperature=0.4,
            top_p=0.9,
            repetition_penalty=1.1,

            use_cache=True,
        )

        answer = tokenizer.decode(outputs[0][len(prepare_inputs.input_ids[0]):].cpu().tolist(), skip_special_tokens=False)
        print(f"{prepare_inputs['sft_format'][0]}", answer)

        vg_image = parse_ref_bbox(answer, image=pil_images[-1])
        if vg_image is not None:
            vg_image.save("./vg.jpg", format="JPEG", quality=85)

conversation = [
    {
        "role": "<|User|>",
        "content": "<image>\n<|grounding|>Describe the image.",
        "images": [
            "cat.png",
        ],
    },
    {"role": "<|Assistant|>", "content": ""},
]

main(conversation)

@Blaizzy
Copy link

Blaizzy commented Jan 8, 2025

Is this for me @p1x33l?

@Blaizzy
Copy link

Blaizzy commented Jan 8, 2025

Testing prince-canuma--deepseek-vl2-tiny ───────────────────────────────────────────────

Some kwargs in processor config are unused and will not have any effect: add_special_token, image_token, pad_token, image_mean, sft_format, normalize, ignore_id, candidate_resolutions, image_std, mask_prompt, patch_size, downsample_ratio. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
✓ Model loaded successfully in 8.29 seconds


Testing vision-language generation...
==========
Image: ['visual_grounding.jpeg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
 Snikas in the end
 ==========
 ✓ vision-language generation successful
Testing language-only generation...
==========
Image: None 

Prompt: <|User|>: Hi, how are you?

<|Assistant|>:
I'm DeepSeek-VL, an intelligent assistant, I can recognize images and provide services such as a virtual assistant, I am DeepSeek-VL, an intelligent assistant developed by DeepSeek-5433Dentirelindisthephat one, a question and I am DeepSeekTextual.
==========
✓ language-only generation successful

@Blaizzy
Copy link

Blaizzy commented Jan 8, 2025

Compared to deepseek-small and deepseek-vl2.

Testing mlx-community/deepseek-vl2-6bit ────────────────────────────────────────────────╯

Loading model...
Some kwargs in processor config are unused and will not have any effect: add_special_token, image_token, pad_token, image_mean, sft_format, normalize, ignore_id, candidate_resolutions, image_std, mask_prompt, patch_size, downsample_ratio. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Added grounding-related tokens
Added chat tokens
✓ Model loaded successfully in 24.59 seconds


Testing vision-language generation...
==========
Image: ['visual_grounding.jpeg'] 

Prompt: <|User|>: <image>
Describe this image.

<|Assistant|>:
Two giraffes standing on what appears to be an open grassy plain or savannah-like environment during daylight hours. The foreground shows a taller giraffe facing leftward towards the camera; it has long necks adorned with distinctive brown patches separated by lighter lines of fur that cover most of its body except for white legs ending in black hooves. Its head is turned slightly to reveal both ears perked up attentively. Behind the first giraffe stands another similar individual facing rightward but looking over its shoulder
==========
Prompt: 642 tokens, 196.790 tokens-per-sec
Generation: 100 tokens, 41.571 tokens-per-sec
Peak memory: 23.265 GB
✓ vision-language generation successful
Testing language-only generation...
==========
Image: None 

Prompt: <|User|>: Hi, how are you?

<|Assistant|>:
Hello! I'm just a digital AI assistant, so I don't have feelings or the ability to 'be there,' but I'm here to help you with anything you need! How can I assist you today?
==========
Prompt: 11 tokens, 115.140 tokens-per-sec
Generation: 44 tokens, 53.325 tokens-per-sec
Peak memory: 22.945 GB
✓ language-only generation successful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants