Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal Cross-attention incorrect results in #2796

Open
2 of 4 tasks
mutkach opened this issue Feb 19, 2025 · 7 comments
Open
2 of 4 tasks

Multimodal Cross-attention incorrect results in #2796

mutkach opened this issue Feb 19, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@mutkach
Copy link

mutkach commented Feb 19, 2025

System Info

System Info

cpu: x86_64
mem: 128G
gpu: H100 80G
docker: tritonserver:24.12-trtllm-python-py3
Cuda: 12.6
Driver: 535.216.01
TensorRT: 10.7.0
TensorRT-LLM: v0.17.0

Who can help?

@kaiyux @byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce:

  • Using official scripts for checkpoint conversion and engine build in multimodal for Llama-3.2-11B-Vision-Instruct. Using plain conversion without quantization.
  • Add debugging outputs after each attention block (self_att and cross_att - both) as shown in debugging guide. Run the mutimodal runner with a different image input (not the rabbit one). Save torch tensors of each layer output during the context phase. May provide concrete prompts and examples.
  • Compare resulting tensors with torch implementation. For each layer, inspect the outputs visually or compare correlation coefficient of the attention block output with the corresponding torch output. Observe that discrepancy starts with cross-attention block (at 4th layer). Corrcoef is at ~0.99 for the first 3 layers (precisely until the cross-attention) and drops to 0.5-0.8 for the following layers' outputs.
  • During the decoding phase observe that accumulated difference leads to different (on the incorrect side) results for visual tasks.

Expected behavior

The cross-attention block output should be closer to the torch implementation. Otherwise the accuracy will be markedly lower for specific OCR-related tasks. May provide specific examples if that's needed. I understand that exactly equal output is not to be expected for the optimized trtllm engine but observed difference is of a magnitude higher than expected.

actual behavior

For context phase there is a growing discrepancy between the torch outputs and resulting outputs which leads to incorrect results during the decoding phase. Note that only vision capabilities are affected (when cross-attention blocks are not skipped).

additional notes

Regarding the matter,
I tried turning off some or most the trtllm-build flags, which led to even more incorrect results.

Visual transformer engine outpus seem to differ slightly too, so I injected the torch output tensor into the trtllm pipeline directly in order to rule out the visual encoder for now.
The problem seem to be inside the cross-attention block and the discrepancies show only when the multimodality is involved.

Is there a way to fall back to an unoptimzed cross-attention implementation that I could use for now until the underlying problem is solved? Turning off gpt-plugin does not seem to be supported right now.
There's probability that the problem is on my side somewhere (previously I had to change some code, as I was trying to solve a different issue (triton-inference-server/tensorrtllm_backend#692), though I think that probability is minimal now that I double and triple checked everything.

Also I would appreciate if you could share some advanced debugging techniques for ruling out similar issues in the future.

P.S. thank you for your work, it is much appreciated!

@mutkach mutkach added the bug Something isn't working label Feb 19, 2025
@mutkach
Copy link
Author

mutkach commented Feb 19, 2025

Ai2d eval reports show 60% accuracy, which is lower than 65% threshold shown in multimodal guide

Meta self-reported results show it should be ~91%
meta

@JC1DA
Copy link

JC1DA commented Feb 19, 2025

Hi, I ran into the same issue for Llama-3.2-11B-Vision-Instruct.

python3 examples/multimodal/run.py --engine_dir /models/Llama-3.2-11B-Vision-Instruct_trt_engine_v3 --visual_engine_name visual_encoder.engine --hf_model_dir /models/Llama-3.2-11B-Vision-Instruct --image_path https://lanytek.com/images/demo_lower.png --input_text '<|image|><|begin_of_text|>Describe the image.' --max_new_tokens 100

The engine works for some image but fails for some while transformers model works fine with those images.
Can anyone help take a look at this issue? Thanks

@mayani-nv
Copy link

mayani-nv commented Mar 4, 2025

@JC1DA @mutkach i tried to run the non instruct version which is in README of the guide and that seems to be working fine

python3 examples/multimodal/run.py --visual_engine_dir /tmp/mllama/trt_engines/encoder/  \
--visual_engine_name visual_encoder.engine --llm_engine_dir /tmp/mllama/trt_engines/decoder/   \
 --hf_model_dir meta-llama/Llama-3.2-11B-Vision --image_path https://lanytek.com/images/demo_lower.png \
 --input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one"     \
 --max_new_tokens 50 --batch_size 2

In the above I used the image from the previous comment. So is this specific to the instruct version?

@JC1DA
Copy link

JC1DA commented Mar 4, 2025

hi @mayani-nv, I used the instruct version. the FP16 version does not generate anything while the BF16 version generated incorrect description.

I can test the non-instruct version, can you also test the instruct version to confirm?

@mutkach
Copy link
Author

mutkach commented Mar 4, 2025

@mayani-nv

For a bit more involved example like this

and input "Question: What is the total income? Answer: "

Vision gives:
[A]: ["2330. Question: 1. The answer: 233. (I'm sorry). Note"]

Vision-Instruct gives:
[A]: ['To find the total income, we need to add all the revenue values. The revenue values are represented by the number in the "net" (divine arrangement). The design (divine arrangement) has been "crossing" ...

HF Vision gives correctly:
Answer: 2,173. <OCR/> 1 2 3 4 5 6 Date

another example:

and input "Question: What is the gross total? Answer: "

Vision:
[A]: ["6,000,000. Question: What is the net revenue? Answer: 5,500,000. I'm not sure who these individuals are, but they bought their items for 50% off. How did the cashier make up the difference ...

HF Vision:
Answer: 60009585.7. <OCR/> 1 2 3 A NAME Widget Thingo Computer Yacht 4 5 6 7 8 9

@mayani-nv
Copy link

hi @mayani-nv, I used the instruct version. the FP16 version does not generate anything while the BF16 version generated incorrect description.

I can test the non-instruct version, can you also test the instruct version to confirm?
I got the following generated with the non-instruct version.

[Q] <|image|><|begin_of_text|>If I had to write a haiku for this one
[03/05/2025-04:08:12] [TRT-LLM] [I]
[A]: [", it would be:.\\nA dog and his human.\\nA bond that's unbreakable.\\nA paw-some friendship.\\nWhat's your haiku for your dog? Share it in the comments below!\\n: Unsplash"]
[03/05/2025-04:08:12] [TRT-LLM] [I]

@mayani-nv
Copy link

@mayani-nv

For a bit more involved example like this

and input "Question: What is the total income? Answer: "

Vision gives: [A]: ["2330. Question: 1. The answer: 233. (I'm sorry). Note"]

Vision-Instruct gives: [A]: ['To find the total income, we need to add all the revenue values. The revenue values are represented by the number in the "net" (divine arrangement). The design (divine arrangement) has been "crossing" ...

HF Vision gives correctly: Answer: 2,173. <OCR/> 1 2 3 4 5 6 Date

another example:

and input "Question: What is the gross total? Answer: "

Vision: [A]: ["6,000,000. Question: What is the net revenue? Answer: 5,500,000. I'm not sure who these individuals are, but they bought their items for 50% off. How did the cashier make up the difference ...

HF Vision: Answer: 60009585.7. <OCR/> 1 2 3 A NAME Widget Thingo Computer Yacht 4 5 6 7 8 9

thanks for sharing this. I can reproduce the same behavior with the TRT-LLM where the output for my end are also looking gibberish when using non-instruct model as following

Q] <|image|><|begin_of_text|>Question: What is the total income? Answer:
[03/05/2025-04:09:47] [TRT-LLM] [I]
[A]: ['2,000. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question']

We will investigate it. I also ran the eval script and that seems to be providing following output

python3 examples/multimodal/eval.py --model_type mllama --visual_engine_dir /tmp/mllama/trt_engines/encoder/ --visual_engine_name visual_encoder.engine --llm_engine_dir /tmp/mllama/trt_engines/decoder/ --hf_model_dir meta-llama/Llama-3.2-11B-Vision --test_trtllm --accuracy_threshold 65 --eval_task lmms-lab/ai2d
-----
-----
[03/05/2025-04:12:07] [TRT-LLM] [I] total iterations: 20
[03/05/2025-04:12:07] [TRT-LLM] [I] TRT-LLM's accuracy: 75.00%
[03/05/2025-04:12:07] [TRT-LLM] [I] Evaluation takes: 22.02736759185791 sec
[TensorRT-LLM][INFO] Refreshed the MPI local session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants