-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The inference output is meaningless #17
Comments
Hi @yydxlv, can you let me know which model you are testing? If possible, can you also share this failure case (image) with us? |
The model name is deepseek-vl2-small I use the following code to load the model `import torch model_name = "models/deepseek-vl2-small" vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_name) The input img is Thank you for help |
Hi @yydxlv, I tested it with DeepSeek-VL2-Small and your image. The results seem to be good: <|User|>: <image>
Describe the image.
<|Assistant|>: The image shows a cat wearing a shirt with various technical diagrams and text related to machine learning and neural networks. The shirt has words like "TRANSFORMER," "ATTENTION MECHANISM," "ENCODER," and "DECODER." The cat is sitting on a rug in a room with a wooden chair and a bookshelf in the background.<|end▁of▁sentence|> Please try our demo code in the readme and directly replace the prompts/images, to see if the results are correct or not. conversation = [
{
"role": "<|User|>",
"content": "<image>\n Describe the image.",
"images": ["./images/cat.png"],
},
{"role": "<|Assistant|>", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation,
images=pil_images,
force_batchify=True,
system_prompt=""
).to(vl_gpt.device)
# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# run the model to get the response
outputs = vl_gpt.language.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
do_sample=False,
use_cache=True
) |
The code looks good to me. I'm not entirely sure but you may check if it is due to different versions of python packages. Below is our testing environment.
|
You need to pass input_ids. There is a PR addressing this issue #11 |
It was truly disappointing. Nothing is perfect everywhere. Maybe it's time to let go! |
+1 |
I can't get deepseek tiny to work well on MLX |
i had the same issue, i extracted the code used in inference.py file and it works fine from typing import List, Dict
import torch
from transformers import AutoModelForCausalLM
import PIL.Image
from deepseek_vl2.models import DeepseekVLV2ForCausalLM, DeepseekVLV2Processor
from deepseek_vl2.serve.app_modules.utils import parse_ref_bbox
def load_pil_images(conversations: List[Dict[str, str]]) -> List[PIL.Image.Image]:
pil_images = []
for message in conversations:
if "images" not in message:
continue
for image_path in message["images"]:
pil_img = PIL.Image.open(image_path)
pil_img = pil_img.convert("RGB")
pil_images.append(pil_img)
return pil_images
def main(conversation, model_path="deepseek-ai/deepseek-vl2-tiny", chunk_size=512):
dtype = torch.bfloat16
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=dtype
)
vl_gpt = vl_gpt.cuda().eval()
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor.__call__(
conversations=conversation,
images=pil_images,
force_batchify=True,
system_prompt=""
).to(vl_gpt.device, dtype=dtype)
with torch.no_grad():
if chunk_size == -1:
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
past_key_values = None
else:
# incremental_prefilling when using 40G GPU for vl2-small
inputs_embeds, past_key_values = vl_gpt.incremental_prefilling(
input_ids=prepare_inputs.input_ids,
images=prepare_inputs.images,
images_seq_mask=prepare_inputs.images_seq_mask,
images_spatial_crop=prepare_inputs.images_spatial_crop,
attention_mask=prepare_inputs.attention_mask,
chunk_size=chunk_size
)
# run the model to get the response
outputs = vl_gpt.generate(
inputs_embeds=inputs_embeds,
input_ids=prepare_inputs.input_ids,
images=prepare_inputs.images,
images_seq_mask=prepare_inputs.images_seq_mask,
images_spatial_crop=prepare_inputs.images_spatial_crop,
attention_mask=prepare_inputs.attention_mask,
past_key_values=past_key_values,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=True,
temperature=0.4,
top_p=0.9,
repetition_penalty=1.1,
use_cache=True,
)
answer = tokenizer.decode(outputs[0][len(prepare_inputs.input_ids[0]):].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)
vg_image = parse_ref_bbox(answer, image=pil_images[-1])
if vg_image is not None:
vg_image.save("./vg.jpg", format="JPEG", quality=85)
conversation = [
{
"role": "<|User|>",
"content": "<image>\n<|grounding|>Describe the image.",
"images": [
"cat.png",
],
},
{"role": "<|Assistant|>", "content": ""},
]
main(conversation) |
Is this for me @p1x33l? |
Testing prince-canuma--deepseek-vl2-tiny ───────────────────────────────────────────────
|
Compared to deepseek-small and deepseek-vl2. Testing mlx-community/deepseek-vl2-6bit ────────────────────────────────────────────────╯
|
When i input an image , the inference output is repetitive and not well-organized.
`conversation = [
{
"role": "<|User|>",
"content": "\n Describe the image.",
"images": ["../data/cat.png"],
},
{"role": "<|Assistant|>", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True, system_prompt="Describe the image.").to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
do_sample=False,
use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)`
The output is follow:
Describe the image.
<|User|>:
Describe the image.
<|Assistant|>: The image shows a cat wearing a shirt with a shirt with a design that resembles a shirt with a design that resembles a shirt with a design that resembles a shirt with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with with
The text was updated successfully, but these errors were encountered: