Support video in MiniCPM-V 2.6 #14

saket424 · 2024-08-07T22:49:16Z

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video

The claim is it performs very well for an 8 billion size model

I am interested in learning what it takes to add support for 2.6 when 2.5 is already supported

Thanks

saket424 · 2024-08-25T18:17:29Z

I tried MiniCPM-V-2_6 naively and I got server-1 | INFO: 192.168.155.172:39070 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity

So need @matatonic assistance

matatonic · 2024-08-25T18:20:58Z

Currently testing, but image only so far, no video.

matatonic · 2024-08-25T18:35:54Z

I've updated a dev branch with the latest changes, including MiniCPM-V 2.6, microsoft/Phi-3.5-vision-instruct and fancyfeast/joy-caption-pre-alpha. I'm still testing and the :dev image is still building, so YMMV.

saket424 · 2024-08-25T20:51:37Z

By video, they mean collection of images (so not quite video)

ggerganov/llama.cpp#9165

saket424 · 2024-08-25T21:52:10Z

the dev build works. thanks
CLI_COMMAND="python vision.py -m openbmb/MiniCPM-V-2_6 --use-flash-attn --device-map cuda:0 --load-in-4bit"
anand@dell4090:~/openedai-stuff/openedai-vision$ docker compose up
[+] Running 2/1
✔ Network openedai-vision_default Created 0.1s
✔ Container openedai-vision-server-1 Created 0.0s
Attaching to server-1
server-1 | 2024-08-25 21:35:27.061 | INFO | main::143 - Loading VisionQnA[minicpm-v-2_6] with openbmb/MiniCPM-V-2_6
Loading checkpoint shards: 100% 4/4 [00:07<00:00, 1.75s/it]
server-1 | 2024-08-25 21:35:36.056 | INFO | vision_qna:loaded_banner:94 - Loaded openbmb/MiniCPM-V-2_6 [ device: cuda:0, dtype: torch.bfloat16, template: internal ]
server-1 | INFO: Started server process [7]
server-1 | INFO: Waiting for application startup.
server-1 | INFO: Application startup complete.
server-1 | INFO: Uvicorn running on http://0.0.0.0:5006 (Press CTRL+C to quit)
preprocessor_config.json: 100% 714/714 [00:00<00:00, 8.25MB/s]
processing_minicpmv.py: 100% 10.0k/10.0k [00:00<00:00, 63.4MB/s]
image_processing_minicpmv.py: 100% 16.6k/16.6k [00:00<00:00, 104MB/s]
server-1 | A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-V-2_6:
server-1 | - image_processing_minicpmv.py
server-1 | . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
server-1 | A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-V-2_6:
server-1 | - processing_minicpmv.py
server-1 | - image_processing_minicpmv.py
server-1 | . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
server-1 | /usr/local/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
server-1 | warnings.warn(
server-1 | INFO: 192.168.155.172:39888 - "POST /v1/chat/completions HTTP/1.1" 200 OK

matatonic · 2024-08-25T23:52:12Z

By video, they mean collection of images (so not quite video)

ggerganov/llama.cpp#9165

Yes, it's an image sampler technique - but still it's not working for me, the sample code they provide is failing to identify the video in my tests. Perhaps still my error, but it probably wont be fixed for this release.

saket424 · 2024-08-26T01:10:28Z

There's another project that I like called amblegpt
https://github.com/mhaowork/amblegpt that has the ffmpeg sampler built in and is openai compatible

We can try to use that to test this functionality

matatonic · 2024-08-26T22:35:17Z

Merged to main, 0.29.0 release. I will leave this ticket open until video is supported.

saket424 · 2024-08-27T09:56:53Z

i tried this standalone python code and it runs on my 4090 gpu

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)

fight.mp4

python3 try.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.71it/s]
num frames: 15
/home/anand/2.6/venv2.6/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
The video begins with a news broadcast from FOX 11 at 5 PM, showing a cityscape during sunset or sunrise. It then transitions to footage of two individuals in what appears to be a school hallway near blue lockers and yellow caution lines on the floor. One individual is wearing a dark shirt and light-colored pants, while the other is in a white top and dark pants. The scene involves physical confrontation where one person is restrained by the other against the wall. The struggle continues as the person in the white top attempts to maintain control over the situation. Eventually, another individual enters, seemingly trying to mediate or intervene. The final frame features the FOX 11 logo with text "ONLY ON FOX 11," indicating exclusive content coverage.

saket424 · 2024-08-27T18:27:39Z

@matatonic
I managed to try this still unfinished PR for llama.cpp and it works

ggerganov/llama.cpp#9165 (comment)

matatonic changed the title ~~Request to support new version MiniCPM-V 2.6 model~~ Support video in MiniCPM-V 2.6 Aug 26, 2024

matatonic added the enhancement New feature or request label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support video in MiniCPM-V 2.6 #14

Support video in MiniCPM-V 2.6 #14

saket424 commented Aug 7, 2024

saket424 commented Aug 25, 2024

matatonic commented Aug 25, 2024

matatonic commented Aug 25, 2024

saket424 commented Aug 25, 2024

saket424 commented Aug 25, 2024

matatonic commented Aug 25, 2024

saket424 commented Aug 26, 2024

matatonic commented Aug 26, 2024

saket424 commented Aug 27, 2024

saket424 commented Aug 27, 2024

Support video in MiniCPM-V 2.6 #14

Support video in MiniCPM-V 2.6 #14

Comments

saket424 commented Aug 7, 2024

saket424 commented Aug 25, 2024

matatonic commented Aug 25, 2024

matatonic commented Aug 25, 2024

saket424 commented Aug 25, 2024

saket424 commented Aug 25, 2024

matatonic commented Aug 25, 2024

saket424 commented Aug 26, 2024

matatonic commented Aug 26, 2024

saket424 commented Aug 27, 2024

saket424 commented Aug 27, 2024