Problems with vectorstores: "RuntimeError: Cannot open header file" #3

kauttoj · 2024-07-19T13:37:56Z

Thanks for this great course! I'm encountering a weird issue when creating those RAG vectorstores. Sometimes there appears a "index_metadata.pickle" file in a subfolder of a persistent vectorstore. If this file is present, vectorstore cannot be loaded as I get "RuntimeError: Cannot open header file" error. If I manually delete that pickle file, issue goes away.

After lots of testing with Langchain and Chroma, this event seems to occur only when vectorstore becomes large enough. Toy examples have no issues. I get this error with your RAG example "2a_rag_basics_metadata.py" which has over 13k chunks. I also get error for the "custom" type vectorstore in your "3_rag_text_splitting_deep_dive.py" example with over 1k items, while other 4 types work fine (no index file created).

Is there any workaround or reasonable cause for this issue? It appears some sort of bug in Langchain and/or Chroma...

EDIT: This appears to be a known issue with Chroma, discussed also here chroma-core/chroma#872
Workaround is to increase HNSW cache limit from default 1000 to avoid writing that index file, e.g., by adding "collection_metadata={"hnsw:sync_threshold": 20000}" when creating a vectorstore. Hopefully this helps others running those RAG example codes.

ELBEQQAL94 · 2024-08-05T14:39:52Z

I have the same issue this is the solution that works for me:

`# Split the document into chunks with a maximum size of 100, with overlap for better context
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
docs = text_splitter.split_documents(documents)

# Display information about the split documents
print("\n--- Document Chunks Information ---")
print(f"Number of document chunks: {len(docs)}")
print(f"Sample chunk:\n{docs[0].page_content}\n")

# Create embeddings
print("\n--- Creating embeddings ---")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Initialize vector store and process documents in batches
print("\n--- Creating vector store ---")
batch_size = 166  # Maximum batch size allowed

# Initialize an empty vector store
db = Chroma(embedding_function=embeddings, persist_directory=persistent_directory)

# Process documents in batches
for i in range(0, len(docs), batch_size):
    batch_docs = docs[i:i + batch_size]
    db.add_documents(batch_docs)

print("\n--- Finished creating vector store ---")

`

razielar · 2025-02-19T17:50:33Z

Thanks @ELBEQQAL94 it also worked for me.

Laaabi · 2025-02-20T16:46:44Z

Thanks for this great course! I'm encountering a weird issue when creating those RAG vectorstores. Sometimes there appears a "index_metadata.pickle" file in a subfolder of a persistent vectorstore. If this file is present, vectorstore cannot be loaded as I get "RuntimeError: Cannot open header file" error. If I manually delete that pickle file, issue goes away.

After lots of testing with Langchain and Chroma, this event seems to occur only when vectorstore becomes large enough. Toy examples have no issues. I get this error with your RAG example "2a_rag_basics_metadata.py" which has over 13k chunks. I also get error for the "custom" type vectorstore in your "3_rag_text_splitting_deep_dive.py" example with over 1k items, while other 4 types work fine (no index file created).

Is there any workaround or reasonable cause for this issue? It appears some sort of bug in Langchain and/or Chroma...

EDIT: This appears to be a known issue with Chroma, discussed also here chroma-core/chroma#872 Workaround is to increase HNSW cache limit from default 1000 to avoid writing that index file, e.g., by adding "collection_metadata={"hnsw:sync_threshold": 20000}" when creating a vectorstore. Hopefully this helps others running those RAG example codes.

Thanks a lot! It works for me too!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with vectorstores: "RuntimeError: Cannot open header file" #3

Problems with vectorstores: "RuntimeError: Cannot open header file" #3

kauttoj commented Jul 19, 2024 •

edited

Loading

ELBEQQAL94 commented Aug 5, 2024

razielar commented Feb 19, 2025

Laaabi commented Feb 20, 2025

Problems with vectorstores: "RuntimeError: Cannot open header file" #3

Problems with vectorstores: "RuntimeError: Cannot open header file" #3

Comments

kauttoj commented Jul 19, 2024 • edited Loading

ELBEQQAL94 commented Aug 5, 2024

razielar commented Feb 19, 2025

Laaabi commented Feb 20, 2025

kauttoj commented Jul 19, 2024 •

edited

Loading