-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I couldn't find the code about Video Encoder in llama 3.2 vision #795
Comments
@blurmemo thanks for your interest. The paper describe overall vision for llama 3 family of models the llama 3.2 is image reasoning only. re 1: you need to send the image and prompt as suggested here. re 2: the llama 3.2 only works with images and one image at a time. Hope that helps to clarify a bit. |
@HamidShojanazeri For the question 2, I want input with multiple images-text pairs or video-text pairs when I fine tuning.As shown below. For input with multiple images-text pairs, Can I modify code to extract For input with video-text pairs, I want to know the implementation codes whether the official implementation codes or Above are some additional notes from me, I am doing some interesting work based on llama-3.2-vision and hope for your help. |
llama 3.2 vision is a good work!
I am doing some interesting work based on llama 3.2 vision. I have read paper about llama 3.2 vision, but I have a very important question to ask.
Below is a image of the model architecture for image-text input
question 1: Can I input only image and answer, no text?
question 2: For video input, after Image Encoder, the encoding results are sent to video branch. I couldn't find out codes about handling Video Image Encoder output branch(That's the red box in the image above) in the HuggingFace implementation(implementation is in the HuggingFace's transformers repository and llama 3.2 vision model path is "transformers/src/transformers/model/mllama"), Can you help telling me the code location?
I really look forward to getting your help eagerly, thank you!
The text was updated successfully, but these errors were encountered: