-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support video understanding #9165
base: master
Are you sure you want to change the base?
Conversation
Very interesting. Video understanding could be the next big thing. Thank you for the contribution! |
Makefile
Outdated
llama-minicpmv-cli: examples/llava/minicpmv-cli.cpp \ | ||
examples/llava/llava.cpp \ | ||
examples/llava/llava.h \ | ||
examples/llava/clip.cpp \ | ||
examples/llava/clip.h \ | ||
$(OBJ_ALL) | ||
$(CXX) $(CXXFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) -Wno-cast-qual | ||
$(CXX) $(CXXFLAGS) $(FFMPEG_CFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) $(FFMPEG_LIBS) -Wno-cast-qual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to only enable support for video with a special flag, for example LLAMA_FFMPEG
(same way with LLAMA_CURL
)
Also, don't forget to add support for cmake linking ffmpeg in cmake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, i will try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken a stab at implementing this compiler flag in an amending PR -- it may or may not be useful to you:
OpenBMB#32
@tc-mb If you like it, feel free to merge that one -- if you do, it should smoothly merge my changes into your PR here. If you don't want it, then no hard feelings -- I won't be offended. :) I'm simply a fan of your work, and generally wanted to make an attempt at helping this PR along.
@tc-mb for example
|
ah i see --video takes an mp4 file as an input and does the sampling internally
|
@tc-mb
|
@tc-mb |
a fifteen second video clip seems to work fine and produces english output . it will be great if we can specify how many images we would like libav to sample from the clip 0.3fps rather than the default of 1 fps
|
@saket424 I saw the log you sent and noticed a situation. The input prompt is after "用户", and the input is in Chinese, and the model's reply is after "AI". The answer is in Chinese, which seems reasonable. |
@tc-mb
|
strange. I have reproduced the problem. Thank you very much for helping me find this problem. I didn't find it before. |
@tc-mb, Is that possible to have the interactive option -i in order to ask follow up questions like you can do with image input? Right now it just describes and quits if you specify -i. |
@tc-mb yuri.jpg (Baseline), yuvj420p(pc, bt470bg/unknown/unknown), 4080x3072 ffmpeg -i yuri.jpg -vf scale=1280:-1 yuri-small.jpg yuri-small.jpg (Baseline), yuvj420p(pc, bt470bg/unknown/unknown), 1280x964 |
No problem, I will update the code this week to make you to use the video understanding features with -i mode. |
I'm sorry, I can't reproduce the bug you posted. I tested large images here, and they are all feasible. I tried both square and rectangular images, and even much larger than the size you mentioned, but they are all usable. You can send me the image, or check whether it is due to insufficient memory. Because I feel that the most likely problem is this. The idea of llama.cpp is that edge devices can also run large models, so the program will continuously apply for small-scale space during the inference process, rather than applying for a very large space at one time during initialization. This is convenient for optimizing performance on edge devices, but sometimes it will cause insufficient memory to jump out during execution. My code is inherited from the original implementation of llava. The source code does not add a judgment after each malloc, which will result in an invalid pointer being used when executing downstream. Only when this pointer is used, an error will be reported, and at this time, it may have skipped many functions away from the real problem. I will add some judgments to the malloc part of the multimodal code later this week, so that the problem of failed memory application can be discovered in time. |
I have documented the crash here #9230 |
@tc-mb is there a separate PR for -i interactive for follow up questions, or are you planning to push more commits to this one? |
I have tested this out, and I was able to successfully get it to answer questions about a video file -- very exciting!! The portion where the video frames are encoded takes a very long time -- adding a message that says something like "encoding video frame 7 of 16" or something may be a nice thing to add. I'm also wondering about other ways to speed up video processing, such as adding a Overall great work, and I'll be very excited for us to get video support added to llama.cpp! |
…nabled optionally by setting LLAMA_FFMPEG=1 in call to make.
…PEG=1 to help users know exactly how to recompile with video support. Suggestion by @Galunid.
I'm sorry that I was busy with another project and I responded a little late. |
ffmpeg compiler flag for video understanding
@tc-mb |
OK, I will adapt to the current main branch. Change it this week. |
I've been a little busy in the past two weeks, and I will revise it as soon as possible. |
Dear llama.cpp official,
hi, as I promised before, after the MiniCPM-V 2.6 merge, I submitted a PR to support video understanding.Because llama.cpp does not currently support video file processing, I think this PR may last for a long time to fully discuss how to integrate video capabilities into the code. But I am ready to actively support the review of this PR in the future.
For MiniCPM-V 2.6, we took the approach of extracting frames from the video file and inputting each frame data sequentially to the model. At the code level, I introduced the open source library ffmpeg to implement video frame extraction.And added the "video" parameter to the args of llama.cpp to read video files.
Before use, install FFmpeg in environment.
run quantized int4 version
The above is the difference in using video. I look forward to your testing and discussion.
Best regards,
MiniCPM-V official ^_^