Alternative to using in-memory collection #45

carstenj-eksponent · 2024-10-16T12:20:36Z

When I load from an index with
model = RAGMultiModalModel.from_index(index_path=index_name)
then I get the following message

You are using in-memory collection. This means every image is stored in memory.
You might want to rethink this if you have a large collection!

I am not sure what the alternative is to using in-memory collection. I browsed the source files without finding anything.

Is there a way to use a database or any other persistent storage?

Thanks,
Carsten

The text was updated successfully, but these errors were encountered:

bclavie · 2024-11-04T06:40:00Z

This is poorly phrased, thank you for flagging. in-memory collection here doesn't refer to the embeddings (there are DB providers, such as Vespa, and byaldi itself will eventually integrated better storage mechanisms if it doesn't become mainstream quickly enough among DB providers), but to the base64 encoded images. When not using in-memory collection, you'd need to store them somewhere else yourself, and use the mapping (i.e. retrieve page X of document Y) to be able to send it to an LLM. With in-memory collection enabled (which is done at indexing time), you don't need to do so, as we save the base64 version of images within the index. This is costly in terms of memory/storage, but also pretty convenient.

I'll update the doc and message to make this clearer.

fvisconti · 2024-11-22T11:38:38Z

Hi, love byaldi :)

Can you add an example of usage for the in-memory collection?

Reading the above answer I think I've over complicating my demo, I do this:

def get_document_pages(index_fetch_res: list):
    index_mapping_to_files = RAG.get_doc_ids_to_file_names()
    for doc_id, file_name in index_mapping_to_files.items():
        index_mapping_to_files[doc_id] = Path(file_name).stem
        
    pages = []
    for res in index_fetch_res:
        doc_id = res["doc_id"]
        page_num = res["page_num"]
        page_path = Path("images-financial") / Path(index_mapping_to_files[doc_id]) / Path(f"page_{page_num}.png")
        pages.append(page_path)

    return pages

img_pages = get_document_pages(results)

Where results are the documents returned by the RAG.search().

Then, in order for the images to be passed to the vlm, I have:

def get_answer(prompt:str, images:List[str], top_1=True):
    if top_1:
        imgs_data = [Image.open(images[0])]
    else:
        imgs_data = [Image.open(image) for image in images]
    # model is instantiated elsewhere
    response = model.generate_content([*imgs_data, prompt])
    
    return response.text

def answer_query(prompt, images, top_1=True):
    return f"Gemini Response\n: {get_answer(prompt, images, top_1=top_1)}"

As you can see, I have the image pages for my documents saved on the file system, and I need to go get those files and open them via the PIL library.

My next step is to save the embeddings in a vector db (and I did not understand how to do this actually), but it would be great also not to fetch image files if they are in memory as base64.

Thanks!

harshaharod21 · 2024-12-01T14:48:16Z

@fvisconti Hi I'm also working on storing the embeddings into a vector database, but rightnow I dont think Byaldi provides any way to store them directly to a vectorDB(like ChromaDB), alternatively we can load the embeddings from the .pt file that stores the embeddings locally to a vectorDB's Collection, but yes its more better with vectorDB for direct loading.
Otherwise I think we have to check for other notebooks having integration examples with qdrant,vespa,milvus,astradb and weaviate for now. This is given in the repo of colpali.
I want to know if you have found any way for this?

fvisconti · 2024-12-20T08:56:25Z

@fvisconti Hi I'm also working on storing the embeddings into a vector database, but rightnow I dont think Byaldi provides any way to store them directly to a vectorDB(like ChromaDB), alternatively we can load the embeddings from the .pt file that stores the embeddings locally to a vectorDB's Collection, but yes its more better with vectorDB for direct loading. Otherwise I think we have to check for other notebooks having integration examples with qdrant,vespa,milvus,astradb and weaviate for now. This is given in the repo of colpali. I want to know if you have found any way for this?

Hi, I found this from Vespa, which seems definitely what we need. There's also something similar from Milvus. Probably it's not straightforward to integrate with byaldi, I'll have a look during Christmas break.

Cheers!

Gupta-Aryaman · 2025-01-09T14:40:26Z

Hey @fvisconti . Were you able to integrate Vespa with byaldi? There is this blog as well of integrating qdrant with colpali but I want to keep using byaldi.

fvisconti · 2025-01-09T14:42:54Z

Hey, not even tried yet; as soon as I do, I'll let you know here for sure :)

fvisconti · 2025-01-09T14:47:38Z

Hi, love byaldi :)

Can you add an example of usage for the in-memory collection?

Reading the above answer I think I've over complicating my demo, I do this:
def get_document_pages(index_fetch_res: list):
    index_mapping_to_files = RAG.get_doc_ids_to_file_names()
    for doc_id, file_name in index_mapping_to_files.items():
        index_mapping_to_files[doc_id] = Path(file_name).stem
        
    pages = []
    for res in index_fetch_res:
        doc_id = res["doc_id"]
        page_num = res["page_num"]
        page_path = Path("images-financial") / Path(index_mapping_to_files[doc_id]) / Path(f"page_{page_num}.png")
        pages.append(page_path)

    return pages

img_pages = get_document_pages(results)
Where results are the documents returned by the RAG.search().

Then, in order for the images to be passed to the vlm, I have:
def get_answer(prompt:str, images:List[str], top_1=True):
    if top_1:
        imgs_data = [Image.open(images[0])]
    else:
        imgs_data = [Image.open(image) for image in images]
    # model is instantiated elsewhere
    response = model.generate_content([*imgs_data, prompt])
    
    return response.text

def answer_query(prompt, images, top_1=True):
    return f"Gemini Response\n: {get_answer(prompt, images, top_1=top_1)}"
As you can see, I have the image pages for my documents saved on the file system, and I need to go get those files and open them via the PIL library.

My next step is to save the embeddings in a vector db (and I did not understand how to do this actually), but it would be great also not to fetch image files if they are in memory as base64.

Thanks!

About this instead, it was quite easy (forgot to mention before), I ended up doing this:

def __get_base64(self) -> List[dict]:
        self.doc_pages = [
            {
                'mime_type': magic.from_buffer(base64.b64decode(r['base64']), mime=True), 
                'data': r['base64']
            } 
            for r in self.search_results
        ]

def rag_search(self, prompt):
    self.search_results = self.rag.search(prompt, k=self.config.get("search_results"))
    if self.config.get("img_path"):
        self.__get_document_pages(self.config.get("img_path"))
    else:
        self.__get_base64()

Where I essentially setup (in self.doc_pages) the arguments for the llm function call.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative to using in-memory collection #45

Alternative to using in-memory collection #45

carstenj-eksponent commented Oct 16, 2024

bclavie commented Nov 4, 2024 •

edited

Loading

fvisconti commented Nov 22, 2024 •

edited

Loading

harshaharod21 commented Dec 1, 2024 •

edited

Loading

fvisconti commented Dec 20, 2024 •

edited

Loading

Gupta-Aryaman commented Jan 9, 2025 •

edited

Loading

fvisconti commented Jan 9, 2025

fvisconti commented Jan 9, 2025

Alternative to using in-memory collection #45

Alternative to using in-memory collection #45

Comments

carstenj-eksponent commented Oct 16, 2024

bclavie commented Nov 4, 2024 • edited Loading

fvisconti commented Nov 22, 2024 • edited Loading

harshaharod21 commented Dec 1, 2024 • edited Loading

fvisconti commented Dec 20, 2024 • edited Loading

Gupta-Aryaman commented Jan 9, 2025 • edited Loading

fvisconti commented Jan 9, 2025

fvisconti commented Jan 9, 2025

bclavie commented Nov 4, 2024 •

edited

Loading

fvisconti commented Nov 22, 2024 •

edited

Loading

harshaharod21 commented Dec 1, 2024 •

edited

Loading

fvisconti commented Dec 20, 2024 •

edited

Loading

Gupta-Aryaman commented Jan 9, 2025 •

edited

Loading