Skip to content

Commit

Permalink
0.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
matatonic committed Apr 2, 2024
1 parent 37db248 commit bc23c72
Show file tree
Hide file tree
Showing 11 changed files with 456 additions and 180 deletions.
35 changes: 26 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,21 @@ An OpenAI API compatible vision server, it functions like `gpt-4-vision-preview`
- Not affiliated with OpenAI in any way

Backend Model support:
- [X] Moondream2 [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) *(only a single image and single question currently supported)
- [X] Llava [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) *(mistral only for now, single image/question)
- [ ] Deepseek-VL - (in progress) [deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
- [X] Moondream2 [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) *(only supports a single image)
- [ ] Moondream1 [vikhyatk/moondream1](https://huggingface.co/vikhyatk/moondream1) *(broken for me)
- [X] LlavaNext [llava-v1.6-mistral-7b-hf, llava-v1.6-34b-hf (llava-v1.6-34b-hf is not working well yet)](https://huggingface.co/llava-hf) *(only supports a single image)
- [X] Llava [llava-v1.5-vicuna-7b-hf, llava-v1.5-vicuna-13b-hf, llava-v1.5-bakLlava-7b-hf](https://huggingface.co/llava-hf) *(only supports a single image)
- [ ] Deepseek-VL - [deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
- [ ] ...

Version: 0.2.0
Version: 0.3.0

Recent updates:
- llava (1.5) / llavanext (1.6+) backends
- multi-turn questions & answers
- chat_with_images.py test tool
- selectable chat formats (phi15, vicuna, chatml, llama2/mistral)
- flash attention 2, accelerate, bitsandbytes (4bit, 8bit) support


API Documentation
Expand All @@ -35,7 +44,7 @@ Usage
-----

```
usage: vision.py [-h] [-m MODEL] [-b BACKEND] [--load-in-4bit] [--load-in-8bit] [--use-flash-attn] [-d DEVICE] [-P PORT] [-H HOST] [--preload]
usage: vision.py [-h] [-m MODEL] [-b BACKEND] [-f FORMAT] [--load-in-4bit] [--load-in-8bit] [--use-flash-attn] [-d DEVICE] [-P PORT] [-H HOST] [--preload]
OpenedAI Vision API Server
Expand All @@ -44,7 +53,9 @@ options:
-m MODEL, --model MODEL
The model to use, Ex. llava-hf/llava-v1.6-mistral-7b-hf (default: vikhyatk/moondream2)
-b BACKEND, --backend BACKEND
The backend to use (moondream, llava) (default: moondream)
The backend to use (moondream1, moondream2, llavanext, llava) (default: moondream2)
-f FORMAT, --format FORMAT
Force a specific chat format. (vicuna, mistral, chatml, llama2, phi15) (default: None)
--load-in-4bit load in 4bit (default: False)
--load-in-8bit load in 8bit (default: False)
--use-flash-attn Use Flash Attention 2 (default: False)
Expand All @@ -66,9 +77,15 @@ docker compose up
Sample API Usage
----------------

`test_vision.py` has a sample of how to use the API.
`chat_with_image.py` has a sample of how to use the API.

Example:
```
$ test_vision.py https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
The image features a long wooden boardwalk running through a lush green field. The boardwalk is situated in a grassy area with trees in the background, creating a serene and picturesque scene. The sky above is filled with clouds, adding to the beauty of the landscape. The boardwalk appears to be a peaceful path for people to walk or hike along, providing a connection between the grassy field and the surrounding environment.
$ ./chat_with_image.py https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
Answer: This is a beautiful image of a wooden path leading through a lush green field. The path appears to be well-trodden, suggesting it's a popular route for walking or hiking. The sky is a clear blue with some scattered clouds, indicating a pleasant day with good weather. The field is vibrant and seems to be well-maintained, which could suggest it's part of a park or nature reserve. The overall scene is serene and inviting, perfect for a peaceful walk in nature.
Question: Are there any animals in the picture?
Answer: No, there are no animals visible in the picture. The focus is on the path and the surrounding natural landscape.
Question:
```
86 changes: 38 additions & 48 deletions backend/llava.py
Original file line number Diff line number Diff line change
@@ -1,58 +1,48 @@
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from transformers import LlavaProcessor, LlavaForConditionalGeneration
from vision_qna import *

# Assumes mistral prompt format!!
# model_id = "llava-hf/llava-v1.6-mistral-7b-hf"

from vision_qna import VisionQnABase
# llava-hf/bakLlava-v1-hf # llama2
# llava-hf/llava-1.5-7b-hf # vicuna
# llava-hf/llava-1.5-13b-hf # vicuna

class VisionQnA(VisionQnABase):
model_name: str = "llava"
format: str = 'vicuna'

def __init__(self, model_id: str, device: str, extra_params = {}):
self.device = self.select_device() if device == 'auto' else device

params = {
'pretrained_model_name_or_path': model_id,
'torch_dtype': torch.float32 if device == 'cpu' else torch.float16,
'low_cpu_mem_usage': True,
}
if extra_params.get('load_in_4bit', False):
load_in_4bit_params = {
'bnb_4bit_compute_dtype': torch.float32 if device == 'cpu' else torch.float16,
'load_in_4bit': True,
}
params.update(load_in_4bit_params)

if extra_params.get('load_in_8bit', False):
load_in_8bit_params = {
'load_in_8bit': True,
}
params.update(load_in_8bit_params)

# 'use_flash_attention_2': True,
if extra_params.get('use_flash_attn', False):
flash_attn_params = {
"attn_implementation": "flash_attention_2",
}
params.update(flash_attn_params)

self.processor = LlavaNextProcessor.from_pretrained(model_id)
self.model = LlavaNextForConditionalGeneration.from_pretrained(**params)
if not (extra_params.get('load_in_4bit', False) or extra_params.get('load_in_8bit', False)):
self.model.to(self.device)
def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
super().__init__(model_id, device, extra_params, format)

if not format:
# guess the format based on model id
if 'mistral' in model_id.lower():
self.format = 'llama2'
elif 'bakllava' in model_id.lower():
self.format = 'llama2'
elif 'vicuna' in model_id.lower():
self.format = 'vicuna'

self.processor = LlavaProcessor.from_pretrained(model_id)
self.model = LlavaForConditionalGeneration.from_pretrained(**self.params)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def single_question(self, image_url: str, prompt: str) -> str:
image = await self.url_to_image(image_url)

# prepare image and text prompt, using the appropriate prompt template
prompt = f"[INST] <image>\n{prompt} [/INST]"
inputs = self.processor(prompt, image, return_tensors="pt").to(self.device)
async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:

images, prompt = await prompt_from_messages(messages, self.format)
inputs = self.processor(prompt, images, return_tensors="pt").to(self.device)

# autoregressively complete prompt
output = self.model.generate(**inputs, max_new_tokens=300)
output = self.model.generate(**inputs, max_new_tokens=max_tokens)
answer = self.processor.decode(output[0], skip_special_tokens=True)
id = answer.rfind('[/INST]')
return answer[id + 8:]

if self.format in ['llama2', 'mistral']:
idx = answer.rfind('[/INST]') + len('[/INST]') + 1 #+ len(images)
return answer[idx:]
elif self.format == 'vicuna':
idx = answer.rfind('ASSISTANT:') + len('ASSISTANT:') + 1 #+ len(images)
return answer[idx:]
elif self.format == 'chatml':
idx = answer.rfind('<|im_user|>assistant\n') + len('<|im_user|>assistant\n') + 1 #+ len(images)
end_idx = answer.rfind('<|im_end|>')
return answer[idx:end_idx]

return answer
50 changes: 50 additions & 0 deletions backend/llavanext.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from vision_qna import *

# model_id = "llava-hf/llava-v1.6-mistral-7b-hf" # llama2
# model_id = "llava-hf/llava-v1.6-34b-hf" # chatml
# model_id = "llava-hf/llava-v1.6-vicuna-13b-hf" # vicuna
# model_id = "llava-hf/llava-v1.6-vicuna-7b-hf" # vicuna

class VisionQnA(VisionQnABase):
model_name: str = "llavanext"
format: str = 'llama2'

def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
super().__init__(model_id, device, extra_params, format)

if not format:
if 'mistral' in model_id:
self.format = 'llama2'
elif 'vicuna' in model_id:
self.format = 'vicuna'
elif 'v1.6-34b' in model_id:
self.format = 'chatml'

self.processor = LlavaNextProcessor.from_pretrained(model_id)
self.model = LlavaNextForConditionalGeneration.from_pretrained(**self.params)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:

images, prompt = await prompt_from_messages(messages, self.format)
inputs = self.processor(prompt, images, return_tensors="pt").to(self.model.device)

output = self.model.generate(**inputs, max_new_tokens=max_tokens)
answer = self.processor.decode(output[0], skip_special_tokens=True)

if self.format in ['llama2', 'mistral']:
idx = answer.rfind('[/INST]') + len('[/INST]') + 1 #+ len(images)
return answer[idx:]
elif self.format == 'vicuna':
idx = answer.rfind('ASSISTANT:') + len('ASSISTANT:') + 1 #+ len(images)
return answer[idx:]
elif self.format == 'chatml':
# XXX This is broken with the 34b, extra spaces in the tokenizer
# XXX You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
idx = answer.rfind('<|im_start|>assistant\n') + len('<|im_start|>assistant\n') + 1 #+ len(images)
end_idx = answer.rfind('<|im_end|>')
return answer[idx:end_idx]

return answer
29 changes: 0 additions & 29 deletions backend/moondream.py

This file was deleted.

44 changes: 44 additions & 0 deletions backend/moondream1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import re
from transformers import CodeGenTokenizerFast, AutoModelForCausalLM

from vision_qna import *

class VisionQnA(VisionQnABase):
model_name: str = "moondream1"
format: str = 'phi15'

def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
super().__init__(model_id, device, extra_params, format)

# not supported yet
del self.params['device_map']

self.tokenizer = CodeGenTokenizerFast.from_pretrained(model_id)
self.model = AutoModelForCausalLM.from_pretrained(**self.params, trust_remote_code=True)

# bitsandbytes already moves the model to the device, so we don't need to do it again.
if not (extra_params.get('load_in_4bit', False) or extra_params.get('load_in_8bit', False)):
self.model.to(self.device)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
images, prompt = await prompt_from_messages(messages, self.format)
encoded_images = self.model.encode_image(images[0]).to(self.device)

# XXX currently broken here...
"""
File "hf_home/modules/transformers_modules/vikhyatk/moondream1/f6e9da68e8f1b78b8f3ee10905d56826db7a5802/modeling_phi.py", line 318, in forward
padding_mask.masked_fill_(key_padding_mask, 0.0)
RuntimeError: The expanded size of the tensor (747) must match the existing size (748) at non-singleton dimension 1. Target sizes: [1, 747]. Tensor sizes: [1, 748]
"""
answer = self.model.generate(
encoded_images,
prompt,
eos_text="<END>",
tokenizer=self.tokenizer,
max_new_tokens=max_tokens,
)[0]
answer = re.sub("<$|<END$", "", answer).strip()
return answer

40 changes: 40 additions & 0 deletions backend/moondream2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import re
from transformers import AutoTokenizer, AutoModelForCausalLM

from vision_qna import *

class VisionQnA(VisionQnABase):
model_name: str = "moondream2"
revision: str = '2024-03-13' # 'main'
format: str = 'phi15'

def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
super().__init__(model_id, device, extra_params, format)

# not supported yet
del self.params['device_map']

self.tokenizer = AutoTokenizer.from_pretrained(model_id)
self.model = AutoModelForCausalLM.from_pretrained(**self.params, trust_remote_code=True)

# # bitsandbytes already moves the model to the device, so we don't need to do it again.
if not (extra_params.get('load_in_4bit', False) or extra_params.get('load_in_8bit', False)):
self.model.to(self.device)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
images, prompt = await prompt_from_messages(messages, self.format)

encoded_images = self.model.encode_image(images).to(self.device)

answer = self.model.generate(
encoded_images,
prompt,
eos_text="<END>",
tokenizer=self.tokenizer,
max_new_tokens=max_tokens,
#**kwargs,
)[0]
answer = re.sub("<$|<END$", "", answer).strip()
return answer
47 changes: 47 additions & 0 deletions chat_with_image.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/usr/bin/env python
import argparse
from datauri import DataURI
from openai import OpenAI

# Initialize argparse
parser = argparse.ArgumentParser(description='Test vision using OpenAI')
parser.add_argument('image_url', type=str, help='URL or image file to be tested')
parser.add_argument('questions', type=str, nargs='*', help='The question to ask the image')
args = parser.parse_args()

client = OpenAI(base_url='http://localhost:5006/v1', api_key='skip')

image_url = args.image_url

if not image_url.startswith('http'):
image_url = str(DataURI.from_file(image_url))

messages = [ { "role": "user", "content": [
{ "type": "text", "text": ' '.join(args.questions) },
{"type": "image_url", "image_url": { "url": image_url } }
]}]

while True:
response = client.chat.completions.create(model="gpt-4-vision-preview", messages=messages, max_tokens=512,)
print(f"Answer: {response.choices[0].message.content}\n")

image_url = None
try:
q = input("Question: ")
# if q.startswith('http'):
# image_url = q
# q = input("Question: ")
except EOFError as e:
break

messages.extend([
{ "role": "assistant", "content": [ { 'type': 'text', 'text': response.choices[0].message.content } ] },
{ "role": "user", "content": [ { 'type': 'text', 'text': q } ] }
])

# if image_url:
# messages[-1]['content'].extend([
# {"type": "image_url", "image_url": { "url": image_url } }
# ])


3 changes: 2 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ services:
- ./hf_home:/app/hf_home
ports:
- 5006:5006
command: ["python", "vision.py", "--host", "0.0.0.0", "--port", "5006", "--backend", "llava", "--model", "llava-hf/llava-v1.6-mistral-7b-hf"]
#command: ["python", "vision.py", "--host", "0.0.0.0", "--port", "5006", "--backend", "llava", "--model", "llava-hf/llava-v1.6-mistral-7b-hf", "--load-in-4bit", "--use-flash-attn"]
command: ["python", "vision.py", "--host", "0.0.0.0", "--port", "5006", "--use-flash-attn"]
runtime: nvidia
deploy:
resources:
Expand Down
Loading

0 comments on commit bc23c72

Please sign in to comment.