Isn't 2048 max lenght would out of context if 6 images put in? #16

OpenJarvisAI · 2024-04-17T13:58:29Z

As I known, clip vit 336 would produce 576 visual tokens per image, the UHD are stack 6 of them, that is 3000+ visual tokens.
How does it able to send to LLM?

guozonghao96 · 2025-01-04T02:24:00Z

We use a resampler to down-sample the feature tokes, ie, a slice 576 to 144. So that, 6 slices only need 1008 tokens to represent.

Moreover, our repository has been fully improved, and almost all bugs have been eliminated. For details, please refer to the main branch and the LLaVA-UHD v1 branch. This issue is now closed. If there are any new problems, feel free to open a new issue.

guozonghao96 closed this as completed Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isn't 2048 max lenght would out of context if 6 images put in? #16

Isn't 2048 max lenght would out of context if 6 images put in? #16

OpenJarvisAI commented Apr 17, 2024

guozonghao96 commented Jan 4, 2025

Isn't 2048 max lenght would out of context if 6 images put in? #16

Isn't 2048 max lenght would out of context if 6 images put in? #16

Comments

OpenJarvisAI commented Apr 17, 2024

guozonghao96 commented Jan 4, 2025