-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull image processing out of the LLM/infrastructure? #5667
Comments
|
I'm not clear on what you're asking for. Are you asking for the M.E.AI libraries to be able to infer the image type from the bytes and then normalize the image format to one the target LLM natively understands?
The LLMs are stateless. You need to provide them with the whole request each time. Some services allow you to upload images ahead of time and then refer to those images by url / ID. And with M.E.AI, you can create an ImageContent referencing a url, so if you want to upload your image somewhere accessible to the LLM and it supports being told about images at a URL (like OpenAI does), then you can avoid sending the image each time. |
With raw RGB buffer I would have meant a simple, very portable format. Similarly with audio - everyone should support PCM, but nobody should pick it as a first choice by default. But I guess there is no such image format. Not entirely sure exactly what I ask. I know little of the standards, the capabilities, behaviors, conventions if any, of either the server hosting the LLM or the LLM itself, even what is typical, and whether I got true answers from perspective of the LLM, or hallucinations. But I can say that I would want to pre-empt processing by the LLM server.
Well then the question is whether M.E.AI has any idea of the capabilities of the LLM server or some component between it and the LLM in terms of metadata it could request about input it accepts, capabilities of reasoning about them and whether there is any standard way of producing errors that M.E.AI could handle (without regex, that is). I think those are all the ways that M.E.AI can glean about the LLM service. At least for Ollama there is not afaik any custom options or metadata that I can get to. If there is metadata from the LLM server that M.E.AI could use, at the very least client side there is a MIME type and I would say there should be a filename extension. Getting dimensions is also usually easy, but as soon as the bytes in an
Here's a fact that suggests me that there is some metadata somewhere that could potentially be used:
Yes, I could have guessed. Apart from sending lots of stuff over repeatedly, processing it again is a cost. Both contribute to the delays I was experiencing. But not as bad as I thought. The delays I was getting have at least two other reasons: switching models and a limit of one model loaded at any time, and a silly aggressively short default expiration of Ollama model loads. The code editor is such an attention hog, don't you agree? 5 minutes is absolutely nothing. I wasn't aware of any of that yet, either. |
First thing I try with
M.E.AI
, using Ollamallama3.2-vision
is to send a 16-bit TIF.Second thing I try is converting to PNG and asking about EXIF metadata, really basic stuff, like what is the resolution of the image. And where on the image is the girl? So what is the approximate coördinate?
Now that might be really stupid novice idiocy, since the LLM isn't good at that whatsoever, also considering I took a file with some huge resolution almost uncompressible, because 16-bit, but it suggests to me:
Why in the world would the server need any specific file format? Shouldn't it be able to receive a raw RGB buffer? A width, a height, and a (compressed/strided) frame. That way M.E.AI would offer the most efficient way of doing that, which would then be pluggable so it would support my TIF. TIF is a very basic format, and it comes out of my photo scanner. Others will want WebP, or AV1. And it shouldn't matter which LLM I use?
Why in the world would it be smart to send that image over every time I got more history? I think it's very wasteful. Even twice. Perhaps it should give me an embedding to send over, or some id to send over on 2nd reference to the same data? I even think it works like that, with caching at the Ollama side, but it wastes my bandwidth at the very least to have to send over the image as fast as the user can press Enter. Sure, I know, I can protect against misuse, especially if its me pressing enter, but that's not the point I'm making.
Now to be fair it can tell me what is wrong with the image (its old, very purple) it does a fairly good job of suggesting what I should do to fix it. And since I am typing here, does it know anything technical about the image, like the resolution? It insists the image is 1024x720, which is certianly not true and is also not the correct aspect. But I have not one single clue who touched my image and what they did, before sending it into the LLM....
The text was updated successfully, but these errors were encountered: