Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support video in MiniCPM-V 2.6 #14

Open
saket424 opened this issue Aug 7, 2024 · 10 comments
Open

Support video in MiniCPM-V 2.6 #14

saket424 opened this issue Aug 7, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@saket424
Copy link

saket424 commented Aug 7, 2024

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video

The claim is it performs very well for an 8 billion size model

I am interested in learning what it takes to add support for 2.6 when 2.5 is already supported

Thanks

@saket424
Copy link
Author

I tried MiniCPM-V-2_6 naively and I got server-1 | INFO: 192.168.155.172:39070 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity

So need @matatonic assistance

@matatonic
Copy link
Owner

Currently testing, but image only so far, no video.

@matatonic
Copy link
Owner

I've updated a dev branch with the latest changes, including MiniCPM-V 2.6, microsoft/Phi-3.5-vision-instruct and fancyfeast/joy-caption-pre-alpha. I'm still testing and the :dev image is still building, so YMMV.

@saket424
Copy link
Author

By video, they mean collection of images (so not quite video)

ggerganov/llama.cpp#9165

@saket424
Copy link
Author

the dev build works. thanks
CLI_COMMAND="python vision.py -m openbmb/MiniCPM-V-2_6 --use-flash-attn --device-map cuda:0 --load-in-4bit"
anand@dell4090:~/openedai-stuff/openedai-vision$ docker compose up
[+] Running 2/1
✔ Network openedai-vision_default Created 0.1s
✔ Container openedai-vision-server-1 Created 0.0s
Attaching to server-1
server-1 | 2024-08-25 21:35:27.061 | INFO | main::143 - Loading VisionQnA[minicpm-v-2_6] with openbmb/MiniCPM-V-2_6
Loading checkpoint shards: 100% 4/4 [00:07<00:00, 1.75s/it]
server-1 | 2024-08-25 21:35:36.056 | INFO | vision_qna:loaded_banner:94 - Loaded openbmb/MiniCPM-V-2_6 [ device: cuda:0, dtype: torch.bfloat16, template: internal ]
server-1 | INFO: Started server process [7]
server-1 | INFO: Waiting for application startup.
server-1 | INFO: Application startup complete.
server-1 | INFO: Uvicorn running on http://0.0.0.0:5006 (Press CTRL+C to quit)
preprocessor_config.json: 100% 714/714 [00:00<00:00, 8.25MB/s]
processing_minicpmv.py: 100% 10.0k/10.0k [00:00<00:00, 63.4MB/s]
image_processing_minicpmv.py: 100% 16.6k/16.6k [00:00<00:00, 104MB/s]
server-1 | A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-V-2_6:
server-1 | - image_processing_minicpmv.py
server-1 | . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
server-1 | A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-V-2_6:
server-1 | - processing_minicpmv.py
server-1 | - image_processing_minicpmv.py
server-1 | . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
server-1 | /usr/local/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
server-1 | warnings.warn(
server-1 | INFO: 192.168.155.172:39888 - "POST /v1/chat/completions HTTP/1.1" 200 OK

@matatonic
Copy link
Owner

By video, they mean collection of images (so not quite video)

ggerganov/llama.cpp#9165

Yes, it's an image sampler technique - but still it's not working for me, the sample code they provide is failing to identify the video in my tests. Perhaps still my error, but it probably wont be fixed for this release.

@saket424
Copy link
Author

There's another project that I like called amblegpt
https://github.com/mhaowork/amblegpt that has the ffmpeg sampler built in and is openai compatible

We can try to use that to test this functionality

@matatonic matatonic changed the title Request to support new version MiniCPM-V 2.6 model Support video in MiniCPM-V 2.6 Aug 26, 2024
@matatonic
Copy link
Owner

Merged to main, 0.29.0 release. I will leave this ticket open until video is supported.

@saket424
Copy link
Author

i tried this standalone python code and it runs on my 4090 gpu

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)
fight.mp4
python3 try.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.71it/s]
num frames: 15
/home/anand/2.6/venv2.6/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
The video begins with a news broadcast from FOX 11 at 5 PM, showing a cityscape during sunset or sunrise. It then transitions to footage of two individuals in what appears to be a school hallway near blue lockers and yellow caution lines on the floor. One individual is wearing a dark shirt and light-colored pants, while the other is in a white top and dark pants. The scene involves physical confrontation where one person is restrained by the other against the wall. The struggle continues as the person in the white top attempts to maintain control over the situation. Eventually, another individual enters, seemingly trying to mediate or intervene. The final frame features the FOX 11 logo with text "ONLY ON FOX 11," indicating exclusive content coverage.

@saket424
Copy link
Author

@matatonic
I managed to try this still unfinished PR for llama.cpp and it works

ggerganov/llama.cpp#9165 (comment)

@matatonic matatonic added the enhancement New feature or request label Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants