0.3.0

matatonic · Apr 2, 2024 · bc23c72 · bc23c72
1 parent 37db248
commit bc23c72
Show file tree

Hide file tree

Showing 11 changed files with 456 additions and 180 deletions.
diff --git a/README.md b/README.md
@@ -8,12 +8,21 @@ An OpenAI API compatible vision server, it functions like `gpt-4-vision-preview`
 - Not affiliated with OpenAI in any way
 
 Backend Model support:
-- [X] Moondream2 [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) *(only a single image and single question currently supported)
-- [X] Llava [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) *(mistral only for now, single image/question)
-- [ ] Deepseek-VL - (in progress) [deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
+- [X] Moondream2 [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) *(only supports a single image)
+- [ ] Moondream1 [vikhyatk/moondream1](https://huggingface.co/vikhyatk/moondream1) *(broken for me)
+- [X] LlavaNext [llava-v1.6-mistral-7b-hf, llava-v1.6-34b-hf (llava-v1.6-34b-hf is not working well yet)](https://huggingface.co/llava-hf) *(only supports a single image)
+- [X] Llava [llava-v1.5-vicuna-7b-hf, llava-v1.5-vicuna-13b-hf, llava-v1.5-bakLlava-7b-hf](https://huggingface.co/llava-hf) *(only supports a single image)
+- [ ] Deepseek-VL - [deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
 - [ ] ...
 
-Version: 0.2.0
+Version: 0.3.0
+
+Recent updates:
+- llava (1.5) / llavanext (1.6+) backends
+- multi-turn questions & answers
+- chat_with_images.py test tool
+- selectable chat formats (phi15, vicuna, chatml, llama2/mistral)
+- flash attention 2, accelerate, bitsandbytes (4bit, 8bit) support
 
 
 API Documentation
@@ -35,7 +44,7 @@ Usage
 -----
 
 ```
-usage: vision.py [-h] [-m MODEL] [-b BACKEND] [--load-in-4bit] [--load-in-8bit] [--use-flash-attn] [-d DEVICE] [-P PORT] [-H HOST] [--preload]
+usage: vision.py [-h] [-m MODEL] [-b BACKEND] [-f FORMAT] [--load-in-4bit] [--load-in-8bit] [--use-flash-attn] [-d DEVICE] [-P PORT] [-H HOST] [--preload]
 
 OpenedAI Vision API Server
 
@@ -44,7 +53,9 @@ options:
   -m MODEL, --model MODEL
                         The model to use, Ex. llava-hf/llava-v1.6-mistral-7b-hf (default: vikhyatk/moondream2)
   -b BACKEND, --backend BACKEND
-                        The backend to use (moondream, llava) (default: moondream)
+                        The backend to use (moondream1, moondream2, llavanext, llava) (default: moondream2)
+  -f FORMAT, --format FORMAT
+                        Force a specific chat format. (vicuna, mistral, chatml, llama2, phi15) (default: None)
   --load-in-4bit        load in 4bit (default: False)
   --load-in-8bit        load in 8bit (default: False)
   --use-flash-attn      Use Flash Attention 2 (default: False)
@@ -66,9 +77,15 @@ docker compose up
 Sample API Usage
 ----------------
 
-`test_vision.py` has a sample of how to use the API.
+`chat_with_image.py` has a sample of how to use the API.
+
 Example:
 ```
-$ test_vision.py https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
-The image features a long wooden boardwalk running through a lush green field. The boardwalk is situated in a grassy area with trees in the background, creating a serene and picturesque scene. The sky above is filled with clouds, adding to the beauty of the landscape. The boardwalk appears to be a peaceful path for people to walk or hike along, providing a connection between the grassy field and the surrounding environment.
+$ ./chat_with_image.py https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
+Answer: This is a beautiful image of a wooden path leading through a lush green field. The path appears to be well-trodden, suggesting it's a popular route for walking or hiking. The sky is a clear blue with some scattered clouds, indicating a pleasant day with good weather. The field is vibrant and seems to be well-maintained, which could suggest it's part of a park or nature reserve. The overall scene is serene and inviting, perfect for a peaceful walk in nature.
+
+Question: Are there any animals in the picture?
+Answer: No, there are no animals visible in the picture. The focus is on the path and the surrounding natural landscape. 
+
+Question: 
 ```
diff --git a/backend/llava.py b/backend/llava.py
@@ -1,58 +1,48 @@
-from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
-import torch
+from transformers import LlavaProcessor, LlavaForConditionalGeneration
+from vision_qna import *
 
-# Assumes mistral prompt format!!
-# model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
-
-from vision_qna import VisionQnABase
+# llava-hf/bakLlava-v1-hf # llama2
+# llava-hf/llava-1.5-7b-hf # vicuna
+# llava-hf/llava-1.5-13b-hf # vicuna
 
 class VisionQnA(VisionQnABase):
     model_name: str = "llava"
+    format: str = 'vicuna'
 
-    def __init__(self, model_id: str, device: str, extra_params = {}):
-        self.device = self.select_device() if device == 'auto' else device
-
-        params = {
-            'pretrained_model_name_or_path': model_id,
-            'torch_dtype': torch.float32 if device == 'cpu' else torch.float16,
-            'low_cpu_mem_usage': True,
-        }
-        if extra_params.get('load_in_4bit', False):
-            load_in_4bit_params = {
-                'bnb_4bit_compute_dtype': torch.float32 if device == 'cpu' else torch.float16,
-                'load_in_4bit': True,
-            }
-            params.update(load_in_4bit_params)
-
-        if extra_params.get('load_in_8bit', False):
-            load_in_8bit_params = {
-                'load_in_8bit': True,
-            }
-            params.update(load_in_8bit_params)
-
-#            'use_flash_attention_2': True,
-        if extra_params.get('use_flash_attn', False):
-            flash_attn_params = {
-                "attn_implementation": "flash_attention_2",
-            }
-            params.update(flash_attn_params)
-
-        self.processor = LlavaNextProcessor.from_pretrained(model_id)
-        self.model = LlavaNextForConditionalGeneration.from_pretrained(**params)
-        if not (extra_params.get('load_in_4bit', False) or extra_params.get('load_in_8bit', False)):
-            self.model.to(self.device)
+    def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
+        super().__init__(model_id, device, extra_params, format)
+
+        if not format:
+            # guess the format based on model id
+            if 'mistral' in model_id.lower():
+                self.format = 'llama2'
+            elif 'bakllava' in model_id.lower():
+                self.format = 'llama2'
+            elif 'vicuna' in model_id.lower():
+                self.format = 'vicuna'
+
+        self.processor = LlavaProcessor.from_pretrained(model_id)
+        self.model = LlavaForConditionalGeneration.from_pretrained(**self.params)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def single_question(self, image_url: str, prompt: str) -> str:
-        image = await self.url_to_image(image_url)
-
-        # prepare image and text prompt, using the appropriate prompt template
-        prompt = f"[INST] <image>\n{prompt} [/INST]"
-        inputs = self.processor(prompt, image, return_tensors="pt").to(self.device)
+    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+
+        images, prompt = await prompt_from_messages(messages, self.format)
+        inputs = self.processor(prompt, images, return_tensors="pt").to(self.device)
 
-        # autoregressively complete prompt
-        output = self.model.generate(**inputs, max_new_tokens=300)
+        output = self.model.generate(**inputs, max_new_tokens=max_tokens)
         answer = self.processor.decode(output[0], skip_special_tokens=True)
-        id = answer.rfind('[/INST]')
-        return answer[id + 8:]
+
+        if self.format in ['llama2', 'mistral']:
+            idx = answer.rfind('[/INST]') + len('[/INST]') + 1 #+ len(images)
+            return answer[idx:]
+        elif self.format == 'vicuna':
+            idx = answer.rfind('ASSISTANT:') + len('ASSISTANT:') + 1 #+ len(images)
+            return answer[idx:]
+        elif self.format == 'chatml':
+            idx = answer.rfind('<|im_user|>assistant\n') + len('<|im_user|>assistant\n') + 1 #+ len(images)
+            end_idx = answer.rfind('<|im_end|>')
+            return answer[idx:end_idx]
+
+        return answer
diff --git a/backend/llavanext.py b/backend/llavanext.py
@@ -0,0 +1,50 @@
+from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
+from vision_qna import *
+
+# model_id = "llava-hf/llava-v1.6-mistral-7b-hf" # llama2
+# model_id = "llava-hf/llava-v1.6-34b-hf" # chatml
+# model_id = "llava-hf/llava-v1.6-vicuna-13b-hf" # vicuna
+# model_id = "llava-hf/llava-v1.6-vicuna-7b-hf" #  vicuna
+
+class VisionQnA(VisionQnABase):
+    model_name: str = "llavanext"
+    format: str = 'llama2'
+
+    def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
+        super().__init__(model_id, device, extra_params, format)
+
+        if not format:
+            if 'mistral' in model_id:
+                self.format = 'llama2'
+            elif 'vicuna' in model_id:
+                self.format = 'vicuna'
+            elif 'v1.6-34b' in model_id:
+                self.format = 'chatml'
+
+        self.processor = LlavaNextProcessor.from_pretrained(model_id)
+        self.model = LlavaNextForConditionalGeneration.from_pretrained(**self.params)
+
+        print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
+
+    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+
+        images, prompt = await prompt_from_messages(messages, self.format)
+        inputs = self.processor(prompt, images, return_tensors="pt").to(self.model.device)
+
+        output = self.model.generate(**inputs, max_new_tokens=max_tokens)
+        answer = self.processor.decode(output[0], skip_special_tokens=True)
+
+        if self.format in ['llama2', 'mistral']:
+            idx = answer.rfind('[/INST]') + len('[/INST]') + 1 #+ len(images)
+            return answer[idx:]
+        elif self.format == 'vicuna':
+            idx = answer.rfind('ASSISTANT:') + len('ASSISTANT:') + 1 #+ len(images)
+            return answer[idx:]
+        elif self.format == 'chatml':
+            # XXX This is broken with the 34b, extra spaces in the tokenizer
+            # XXX You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
+            idx = answer.rfind('<|im_start|>assistant\n') + len('<|im_start|>assistant\n') + 1 #+ len(images)
+            end_idx = answer.rfind('<|im_end|>')
+            return answer[idx:end_idx]
+
+        return answer
diff --git a/backend/moondream.py b/backend/moondream.py
diff --git a/backend/moondream1.py b/backend/moondream1.py
@@ -0,0 +1,44 @@
+import re
+from transformers import CodeGenTokenizerFast, AutoModelForCausalLM
+
+from vision_qna import *
+
+class VisionQnA(VisionQnABase):
+    model_name: str = "moondream1"
+    format: str = 'phi15'
+
+    def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
+        super().__init__(model_id, device, extra_params, format)
+
+        # not supported yet
+        del self.params['device_map']
+
+        self.tokenizer = CodeGenTokenizerFast.from_pretrained(model_id)
+        self.model = AutoModelForCausalLM.from_pretrained(**self.params, trust_remote_code=True)
+
+        # bitsandbytes already moves the model to the device, so we don't need to do it again.
+        if not (extra_params.get('load_in_4bit', False) or extra_params.get('load_in_8bit', False)):
+           self.model.to(self.device)
+
+        print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
+
+    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+        images, prompt = await prompt_from_messages(messages, self.format)
+        encoded_images = self.model.encode_image(images[0]).to(self.device)
+
+        # XXX currently broken here... 
+        """
+          File "hf_home/modules/transformers_modules/vikhyatk/moondream1/f6e9da68e8f1b78b8f3ee10905d56826db7a5802/modeling_phi.py", line 318, in forward
+    padding_mask.masked_fill_(key_padding_mask, 0.0)
+RuntimeError: The expanded size of the tensor (747) must match the existing size (748) at non-singleton dimension 1.  Target sizes: [1, 747].  Tensor sizes: [1, 748]
+        """
+        answer = self.model.generate(
+            encoded_images,
+            prompt,
+            eos_text="<END>",
+            tokenizer=self.tokenizer,
+            max_new_tokens=max_tokens,
+        )[0]
+        answer = re.sub("<$|<END$", "", answer).strip()
+        return answer
+
diff --git a/backend/moondream2.py b/backend/moondream2.py
@@ -0,0 +1,40 @@
+import re
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+from vision_qna import *
+
+class VisionQnA(VisionQnABase):
+    model_name: str = "moondream2"
+    revision: str = '2024-03-13' # 'main'
+    format: str = 'phi15'
+
+    def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
+        super().__init__(model_id, device, extra_params, format)
+
+        # not supported yet
+        del self.params['device_map']
+
+        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
+        self.model = AutoModelForCausalLM.from_pretrained(**self.params, trust_remote_code=True)
+
+#        # bitsandbytes already moves the model to the device, so we don't need to do it again.
+        if not (extra_params.get('load_in_4bit', False) or extra_params.get('load_in_8bit', False)):
+           self.model.to(self.device)
+
+        print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
+
+    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+        images, prompt = await prompt_from_messages(messages, self.format)
+
+        encoded_images = self.model.encode_image(images).to(self.device)
+
+        answer = self.model.generate(
+            encoded_images,
+            prompt,
+            eos_text="<END>",
+            tokenizer=self.tokenizer,
+            max_new_tokens=max_tokens,
+            #**kwargs,
+        )[0]
+        answer = re.sub("<$|<END$", "", answer).strip()
+        return answer
diff --git a/chat_with_image.py b/chat_with_image.py
@@ -0,0 +1,47 @@
+#!/usr/bin/env python
+import argparse
+from datauri import DataURI
+from openai import OpenAI
+
+# Initialize argparse
+parser = argparse.ArgumentParser(description='Test vision using OpenAI')
+parser.add_argument('image_url', type=str, help='URL or image file to be tested')
+parser.add_argument('questions', type=str, nargs='*', help='The question to ask the image')
+args = parser.parse_args()
+
+client = OpenAI(base_url='http://localhost:5006/v1', api_key='skip')
+
+image_url = args.image_url
+
+if not image_url.startswith('http'):
+  image_url = str(DataURI.from_file(image_url))
+
+messages = [ { "role": "user", "content": [
+    { "type": "text", "text": ' '.join(args.questions) },
+    {"type": "image_url", "image_url": { "url": image_url } }
+  ]}]
+
+while True:
+  response = client.chat.completions.create(model="gpt-4-vision-preview", messages=messages, max_tokens=512,)
+  print(f"Answer: {response.choices[0].message.content}\n")
+
+  image_url = None
+  try:
+    q = input("Question: ")
+ #   if q.startswith('http'):
+ #     image_url = q
+ #     q = input("Question: ")
+  except EOFError as e:
+    break
+
+  messages.extend([
+    { "role": "assistant", "content": [ { 'type': 'text', 'text': response.choices[0].message.content } ] },
+    { "role": "user", "content": [ { 'type': 'text', 'text': q } ] }
+  ])
+
+#  if image_url:
+#    messages[-1]['content'].extend([
+#      {"type": "image_url", "image_url": { "url": image_url } }
+#    ])
+
+
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -11,7 +11,8 @@ services:
       - ./hf_home:/app/hf_home
     ports:
       - 5006:5006
-    command: ["python", "vision.py", "--host", "0.0.0.0", "--port", "5006", "--backend", "llava", "--model", "llava-hf/llava-v1.6-mistral-7b-hf"]
+    #command: ["python", "vision.py", "--host", "0.0.0.0", "--port", "5006", "--backend", "llava", "--model", "llava-hf/llava-v1.6-mistral-7b-hf", "--load-in-4bit", "--use-flash-attn"]
+    command: ["python", "vision.py", "--host", "0.0.0.0", "--port", "5006", "--use-flash-attn"]
     runtime: nvidia
     deploy:
       resources: