Always have same response #21

kehanlu · 2023-07-20T14:56:36Z

Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.

model = MM_LLMs.from_pretrained(
        "trained_model/mm_llms_trainer",
        config = model_config,
    )
model.eval()
# ...

instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
    input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)

# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)

with torch.no_grad():
    bs = 1
    
    inputs = {
        "videos": None,
        "images": image.half(),
        "audios": None,
        "input_ids": input_ids,
        'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
        'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
        'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
        'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
        'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
        'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
    }

    for k,v in inputs.items():
        if v is not None:
            inputs[k] = v.to(device)
    inputs['inference'] = True
    
    
    text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
    
    print()
    print(text_embeddings.size())
        

    model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
    generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How many boats are in the picture?

### Response:
========================================
There are 5000 in the picture.
========================================

No matter what image I gave to the model. The model always replies There are 5000 in the picture. with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.

Did I do anything wrong? Thank you.

The text was updated successfully, but these errors were encountered:

chatsci · 2023-07-22T03:05:20Z

How did you get the tokenizer?

Regarding your problem, I think maybe it's because you are using model.llm, which is just the llama part? In this case, seems the whisper and clip part are not used.

From what I understand, we can run the model by:

model.eval()
with torch.no_grad():
    generate_ids = model(data_item)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_texts)

lyuchenyang · 2023-07-22T08:19:51Z

Hi, thanks for sharing the infomation. We are currently checking it.

kehanlu · 2023-07-24T05:51:22Z

Hi @chatsci,
My code is modified from llm_trainer.py and modeling.py.

Macaw-LLM/llm_trainer.py

Lines 466 to 489 in d03e59d

    
           inputs = {'videos': all_video_frames.half(), 
        
                   'audios': all_audio_mels.half(), 
        
                   'images': all_images.half(), 
        
                   'input_ids': input_ids, 
        
                   # 'attention_mask': torch.tensor([1] * seq_len, dtype=torch.int).reshape(bs, -1).contiguous(), 
        
                   # 'labels': None, 
        
                   'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int), 
        
                   'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int), 
        
                   'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int), 
        
                   'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int), 
        
                   'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int), 
        
                   'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int), 
        
                   } 
        
           inputs = {k: inputs[k].to(device) for k in inputs} 
        
           inputs['inference'] = True 
        
           try: 
        
               generate_ids = model(inputs) 
        
           except Exception as e: 
        
               continue 
        
           input_text = tokenizer.batch_decode(input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] 
        
           generated_text = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Macaw-LLM/modeling.py

Lines 952 to 963 in d03e59d

    
           text_embeddings, attention_mask, labels = self.prepare_inputs_for_generation(inputs) 
        
           if 'inference' in inputs and inputs['inference'] is True: 
        
               # generate_ids = self.llm.generate(input_ids=inputs['input_ids'], inputs_embeds=text_embeddings, max_new_tokens=128) 
        
               # generate_ids = self.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128) 
        
               # The code below will possibly trigger an error in : https://github.com/microsoft/DeepSpeed/issues/3156 (the solution only partially resolves the bug for me) 
        
               generate_ids = self.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006) 
        
               return generate_ids 
        
           outputs = self.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels) 
        
           return outputs

I call the functions inside model() forward to test it more easily. The function prepare_inputs_for_generation will prepare the multi-modal tokens for the following LLM (encode the multi-modal features and concatenate with the text instruction).

I'm pretty sure that the input tokens for LLM contain image tokens. While conducting tests, I noticed that the model appears to disregard the image input and only generates responses based on the text portion.

lyuchenyang · 2023-07-24T16:39:52Z

Hi, thanks for sharing this information with us. I think the possible reasons could be some incompatibility issues within the code. As I'm currently on traveling, I will look into it as soon as possible when travel is finished. Would you mind sending the code you used to my email: [email protected] for me to take a look？

dbountouridis · 2023-10-30T11:03:24Z

Hey @lyuchenyang , I have been experiencing the same issue during inference. Are there any updates on this? Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always have same response #21

Always have same response #21

kehanlu commented Jul 20, 2023

chatsci commented Jul 22, 2023 •

edited

Loading

lyuchenyang commented Jul 22, 2023

kehanlu commented Jul 24, 2023

lyuchenyang commented Jul 24, 2023

dbountouridis commented Oct 30, 2023

Always have same response #21

Always have same response #21

Comments

kehanlu commented Jul 20, 2023

chatsci commented Jul 22, 2023 • edited Loading

lyuchenyang commented Jul 22, 2023

kehanlu commented Jul 24, 2023

lyuchenyang commented Jul 24, 2023

dbountouridis commented Oct 30, 2023

chatsci commented Jul 22, 2023 •

edited

Loading