Multimodal Cross-attention incorrect results in #2796

mutkach · 2025-02-19T10:30:03Z

System Info

cpu: x86_64
mem: 128G
gpu: H100 80G
docker: tritonserver:24.12-trtllm-python-py3
Cuda: 12.6
Driver: 535.216.01
TensorRT: 10.7.0
TensorRT-LLM: v0.17.0

Who can help?

@kaiyux @byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce:

Using official scripts for checkpoint conversion and engine build in multimodal for Llama-3.2-11B-Vision-Instruct. Using plain conversion without quantization.
Add debugging outputs after each attention block (self_att and cross_att - both) as shown in debugging guide. Run the mutimodal runner with a different image input (not the rabbit one). Save torch tensors of each layer output during the context phase. May provide concrete prompts and examples.
Compare resulting tensors with torch implementation. For each layer, inspect the outputs visually or compare correlation coefficient of the attention block output with the corresponding torch output. Observe that discrepancy starts with cross-attention block (at 4th layer). Corrcoef is at ~0.99 for the first 3 layers (precisely until the cross-attention) and drops to 0.5-0.8 for the following layers' outputs.
During the decoding phase observe that accumulated difference leads to different (on the incorrect side) results for visual tasks.

Expected behavior

The cross-attention block output should be closer to the torch implementation. Otherwise the accuracy will be markedly lower for specific OCR-related tasks. May provide specific examples if that's needed. I understand that exactly equal output is not to be expected for the optimized trtllm engine but observed difference is of a magnitude higher than expected.

actual behavior

For context phase there is a growing discrepancy between the torch outputs and resulting outputs which leads to incorrect results during the decoding phase. Note that only vision capabilities are affected (when cross-attention blocks are not skipped).

additional notes

Regarding the matter,
I tried turning off some or most the trtllm-build flags, which led to even more incorrect results.

Visual transformer engine outpus seem to differ slightly too, so I injected the torch output tensor into the trtllm pipeline directly in order to rule out the visual encoder for now.
The problem seem to be inside the cross-attention block and the discrepancies show only when the multimodality is involved.

Is there a way to fall back to an unoptimzed cross-attention implementation that I could use for now until the underlying problem is solved? Turning off gpt-plugin does not seem to be supported right now.
There's probability that the problem is on my side somewhere (previously I had to change some code, as I was trying to solve a different issue (triton-inference-server/tensorrtllm_backend#692), though I think that probability is minimal now that I double and triple checked everything.

Also I would appreciate if you could share some advanced debugging techniques for ruling out similar issues in the future.

P.S. thank you for your work, it is much appreciated!

The text was updated successfully, but these errors were encountered:

mutkach · 2025-02-19T11:08:22Z

Ai2d eval reports show 60% accuracy, which is lower than 65% threshold shown in multimodal guide

Meta self-reported results show it should be ~91%
meta

JC1DA · 2025-02-19T21:04:23Z

Hi, I ran into the same issue for Llama-3.2-11B-Vision-Instruct.

python3 examples/multimodal/run.py --engine_dir /models/Llama-3.2-11B-Vision-Instruct_trt_engine_v3 --visual_engine_name visual_encoder.engine --hf_model_dir /models/Llama-3.2-11B-Vision-Instruct --image_path https://lanytek.com/images/demo_lower.png --input_text '<|image|><|begin_of_text|>Describe the image.' --max_new_tokens 100

The engine works for some image but fails for some while transformers model works fine with those images.
Can anyone help take a look at this issue? Thanks

mayani-nv · 2025-03-04T20:06:08Z

@JC1DA @mutkach i tried to run the non instruct version which is in README of the guide and that seems to be working fine

python3 examples/multimodal/run.py --visual_engine_dir /tmp/mllama/trt_engines/encoder/  \
--visual_engine_name visual_encoder.engine --llm_engine_dir /tmp/mllama/trt_engines/decoder/   \
 --hf_model_dir meta-llama/Llama-3.2-11B-Vision --image_path https://lanytek.com/images/demo_lower.png \
 --input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one"     \
 --max_new_tokens 50 --batch_size 2

In the above I used the image from the previous comment. So is this specific to the instruct version?

JC1DA · 2025-03-04T22:03:34Z

hi @mayani-nv, I used the instruct version. the FP16 version does not generate anything while the BF16 version generated incorrect description.

I can test the non-instruct version, can you also test the instruct version to confirm?

mutkach · 2025-03-04T23:36:38Z

@mayani-nv

For a bit more involved example like this

and input "Question: What is the total income? Answer: "

Vision gives:
[A]: ["2330. Question: 1. The answer: 233. (I'm sorry). Note"]

Vision-Instruct gives:
[A]: ['To find the total income, we need to add all the revenue values. The revenue values are represented by the number in the "net" (divine arrangement). The design (divine arrangement) has been "crossing" ...

HF Vision gives correctly:
Answer: 2,173. <OCR/> 1 2 3 4 5 6 Date

another example:

and input "Question: What is the gross total? Answer: "

Vision:
[A]: ["6,000,000. Question: What is the net revenue? Answer: 5,500,000. I'm not sure who these individuals are, but they bought their items for 50% off. How did the cashier make up the difference ...

HF Vision:
Answer: 60009585.7. <OCR/> 1 2 3 A NAME Widget Thingo Computer Yacht 4 5 6 7 8 9

mayani-nv · 2025-03-05T04:08:37Z

hi @mayani-nv, I used the instruct version. the FP16 version does not generate anything while the BF16 version generated incorrect description.

I can test the non-instruct version, can you also test the instruct version to confirm?
I got the following generated with the non-instruct version.

[Q] <|image|><|begin_of_text|>If I had to write a haiku for this one
[03/05/2025-04:08:12] [TRT-LLM] [I]
[A]: [", it would be:.\\nA dog and his human.\\nA bond that's unbreakable.\\nA paw-some friendship.\\nWhat's your haiku for your dog? Share it in the comments below!\\n: Unsplash"]
[03/05/2025-04:08:12] [TRT-LLM] [I]

mayani-nv · 2025-03-05T04:13:00Z

@mayani-nv

For a bit more involved example like this

and input "Question: What is the total income? Answer: "

Vision gives: [A]: ["2330. Question: 1. The answer: 233. (I'm sorry). Note"]

Vision-Instruct gives: [A]: ['To find the total income, we need to add all the revenue values. The revenue values are represented by the number in the "net" (divine arrangement). The design (divine arrangement) has been "crossing" ...

HF Vision gives correctly: Answer: 2,173. <OCR/> 1 2 3 4 5 6 Date

another example:

and input "Question: What is the gross total? Answer: "

Vision: [A]: ["6,000,000. Question: What is the net revenue? Answer: 5,500,000. I'm not sure who these individuals are, but they bought their items for 50% off. How did the cashier make up the difference ...

HF Vision: Answer: 60009585.7. <OCR/> 1 2 3 A NAME Widget Thingo Computer Yacht 4 5 6 7 8 9

thanks for sharing this. I can reproduce the same behavior with the TRT-LLM where the output for my end are also looking gibberish when using non-instruct model as following

Q] <|image|><|begin_of_text|>Question: What is the total income? Answer:
[03/05/2025-04:09:47] [TRT-LLM] [I]
[A]: ['2,000. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question: 2. Question']

We will investigate it. I also ran the eval script and that seems to be providing following output

python3 examples/multimodal/eval.py --model_type mllama --visual_engine_dir /tmp/mllama/trt_engines/encoder/ --visual_engine_name visual_encoder.engine --llm_engine_dir /tmp/mllama/trt_engines/decoder/ --hf_model_dir meta-llama/Llama-3.2-11B-Vision --test_trtllm --accuracy_threshold 65 --eval_task lmms-lab/ai2d
-----
-----
[03/05/2025-04:12:07] [TRT-LLM] [I] total iterations: 20
[03/05/2025-04:12:07] [TRT-LLM] [I] TRT-LLM's accuracy: 75.00%
[03/05/2025-04:12:07] [TRT-LLM] [I] Evaluation takes: 22.02736759185791 sec
[TensorRT-LLM][INFO] Refreshed the MPI local session

mutkach added the bug Something isn't working label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal Cross-attention incorrect results in #2796

Multimodal Cross-attention incorrect results in #2796

mutkach commented Feb 19, 2025 •

edited

Loading

mutkach commented Feb 19, 2025 •

edited

Loading

JC1DA commented Feb 19, 2025 •

edited

Loading

mayani-nv commented Mar 4, 2025 •

edited

Loading

JC1DA commented Mar 4, 2025

mutkach commented Mar 4, 2025 •

edited

Loading

mayani-nv commented Mar 5, 2025

mayani-nv commented Mar 5, 2025

Multimodal Cross-attention incorrect results in #2796

Multimodal Cross-attention incorrect results in #2796

Comments

mutkach commented Feb 19, 2025 • edited Loading

System Info

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

mutkach commented Feb 19, 2025 • edited Loading

JC1DA commented Feb 19, 2025 • edited Loading

mayani-nv commented Mar 4, 2025 • edited Loading

JC1DA commented Mar 4, 2025

mutkach commented Mar 4, 2025 • edited Loading

mayani-nv commented Mar 5, 2025

mayani-nv commented Mar 5, 2025

mutkach commented Feb 19, 2025 •

edited

Loading

mutkach commented Feb 19, 2025 •

edited

Loading

JC1DA commented Feb 19, 2025 •

edited

Loading

mayani-nv commented Mar 4, 2025 •

edited

Loading

mutkach commented Mar 4, 2025 •

edited

Loading