-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bulk operations #23
Comments
I have hit this problem with an application I'm building (not yet with pgai). We were ingesting so much data into the system that we'd run into openai's rate limits. The solution here was to build a batch processing job which creates openai embedding batches and, on a schedule, checks if openai has processed the batch, then saving the returned embeddings into the databse. I wonder if pgai could do something like this as well? |
I met the same problem when I wanted to bulk generate embeddings for 25 million + rows, I can not do it without reaching |
Right now only way that is working for me without having any error is doing it by small batches this way: |
Have you tried setting up a vectorizer? When you run the worker, you can specify the number of batches, concurrency, and poll interval. Batches are sent in a single request to openAI. Of course, if you tried to do too much in a small period, you'll bound to be rate limited. We currently don't support the openAI's batch API. The alternative will be setting the vectorizer's config to stay below rate limit's threshold. This will work if you get spikes of ingest, and you're fine with some delay between inserts and generating the embeddings. |
I haven't set one up yet, still exploring options. Is it possible to either extend the vectorizer to support adding embeddings through openai's batch api or manually, via a service which I run? (The latter would then do all the checking and batch creation etc, but would require marking a chunk as "this will be created asynchronously, please do not do anything") |
@kolaente Supporting the batch API in the vectorizer worker docker image could take some effort, and it's not something in our current roadmap. But you can extend the vectorizer, and make your own worker. Once you have something running, we can discuss integrating that into the pgai repo. When you create a vectorizer in your DB with the These is the query we use to fetch items from the queue: pgai/projects/pgai/pgai/vectorizer/vectorizer.py Lines 183 to 184 in 99d62f3
From a very simplistic point of view, I think this is more or less what you need to do:
There are many more pieces, so that's why it's non-trivial to add it right now to the project. If you implement something like this, we'd be very interested to learn about your experience. Hope this helps to get you started. Feel free to reach out if you have more questions, and you can always reach out in the PGAI discord https://discord.com/channels/1246241636019605616/1246243698111676447 |
@alejandrodnm Thanks! I'll look into implementing this and report back with my findings. (might take a while until I get to it though) Would I need to fork and build everything from scratch to extend the vectorizer or is there a clear path to extending it? |
I've just opened a PR for this: #280 |
What is the most performant/efficient way to embed lots of rows? Can we build functions or procedures to make this easy? If not, can we document guidance and provide example code?
The text was updated successfully, but these errors were encountered: