Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I couldn't find the code about Video Encoder in llama 3.2 vision #795

Open
blurmemo opened this issue Nov 20, 2024 · 2 comments
Open

I couldn't find the code about Video Encoder in llama 3.2 vision #795

blurmemo opened this issue Nov 20, 2024 · 2 comments
Labels

Comments

@blurmemo
Copy link

llama 3.2 vision is a good work!
I am doing some interesting work based on llama 3.2 vision. I have read paper about llama 3.2 vision, but I have a very important question to ask.

Below is a image of the model architecture for image-text input
image

question 1: Can I input only image and answer, no text?
question 2: For video input, after Image Encoder, the encoding results are sent to video branch. I couldn't find out codes about handling Video Image Encoder output branch(That's the red box in the image above) in the HuggingFace implementation(implementation is in the HuggingFace's transformers repository and llama 3.2 vision model path is "transformers/src/transformers/model/mllama"), Can you help telling me the code location?

I really look forward to getting your help eagerly, thank you!

@HamidShojanazeri
Copy link
Contributor

@blurmemo thanks for your interest. The paper describe overall vision for llama 3 family of models the llama 3.2 is image reasoning only.

re 1: you need to send the image and prompt as suggested here.

re 2: the llama 3.2 only works with images and one image at a time.

Hope that helps to clarify a bit.

@blurmemo
Copy link
Author

@blurmemo thanks for your interest. The paper describe overall vision for llama 3 family of models the llama 3.2 is image reasoning only.

re 1: you need to send the image and prompt as suggested here.

re 2: the llama 3.2 only works with images and one image at a time.

Hope that helps to clarify a bit.

@HamidShojanazeri
Thank you for your help! I think I may not have expressed very clear in two questions.
For the question 1, when I fine tuning basing llama-3.2-vision, I want to construct my data set with image-text pairs. The image is natural sceen and the text is only the description/content/other of image, so I constrcut as following(raw data, no process).
[{
"images": image,
"texts": [
{
"assistant": "this is image description_1"
},
{
"assistant": "this is image content"
},
{
"assistant": "this is other"
}
]
}, ...]
I do not set key-value pair for every text in the values associated with the key="texts", including "user": "this is question or other" and "system":"criterion or other". So I want to know whether "system":"criterion or other" pair is supported and whether "user": "this is question or other" is must be added in every text when I fine tuning as shown below.
[{
"images": image,
"texts": [
{
"system": "criterion or other"
},
{
"user": "" or "this is question or other",
"assistant": "this is image description_1"
},
{
"user": "" or "this is question or other",
"assistant": "this is image content"
},
{
"user": "" or "this is question or other",
"assistant": "this is other"
}
]
},...]

For the question 2, I want input with multiple images-text pairs or video-text pairs when I fine tuning.As shown below.
[{
"images": [image_1, image_2, ..., image_n],
"texts": [
...
]
},...] (multiple images-text pairs)
or
[{
"video": [video frames],
"texts": [
...
]
}, ...] (video-text pairs)

For input with multiple images-text pairs, Can I modify code to extract images pathces in IMAGE ENCODER and add if images(different from if image) branch to handle IMAGE ENCODER output and send processed output to Cross-attention(in the LANGUAGE MODEL) and fine tuning on my data set so that realizing multiple images-text pairs input?

For input with video-text pairs, I want to know the implementation codes whether the official implementation codes or
HuggingFace implementation codes about the red box in the image(from meta paper) below. If the implementation codes are not provided, I agree and would like to get your confirmation.
image

Above are some additional notes from me, I am doing some interesting work based on llama-3.2-vision and hope for your help.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants