[Bug]: `Query` does not match `Get` on persistent DB accessed by two processes. #3792

amd-tibbetso · 2025-02-13T23:24:26Z

What happened?

Description

If two processes try to interact with the same persistent database, any updates from one are observable in the documents themselves (get), but not in the embedding search (query endpoint).

This is similar to #3769, but given that it is not specific to Jupyter I wanted to add separate observations.

To me, this does not seem like a far-reaching use-case as I encountered the bug in the following situation:
(1) Process 1: a GUI hosted on a web browser, ideally permanently running
(2) Process 2: scripts for admins to update the database (CLI was much more efficient for these DB updates).

However, after the second process is run, the changes to the database can be viewed in process 1 using client.get(), but query fails to see any added or removed documents.

Below is a simple working example of the situation. When delving into the code to discover why I could see new documents but could not query them, I also found that clearing the _instances cache would resolve the issue, although I could not pinpoint which instance was causing it. My minimal code also shows the effect of this.

system = SharedSystemClient._identifier_to_system.get(identifier)
if system:
    system._instances.clear()

Minimal Example

# my system had out of date sqlite3
import sys
__import__("pysqlite3")
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

# sorry for having langchain in the example case.
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_community.embeddings import FakeEmbeddings
from pathlib import Path
import time

db_path = "test-db"
collection_name = "docs"


def print_search(message, client: Chroma):
    print('=====================================')
    print(message)
    docs = client.get()
    print(' - GET')
    for uuid, text in zip(docs['ids'], docs['documents']):
        print(f"     - {uuid[0:8]}...:  {text}")
    docs = client.similarity_search("Hello", k=3)
    print(' - SIMILARITY SEARCH')
    for doc in docs:
        print(f"     - {doc.id[0:8]}...:  {doc.page_content}")


def main() -> None:
    import shutil
    import subprocess, sys

    if Path(db_path).exists():
        shutil.rmtree(db_path)

    # Create the first client and add the initial document.
    client1 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )
    doc1 = Document(page_content="Hello world")
    client1.add_documents([doc1])
    
    print_search("# MAIN CLIENT Search 1", client1)

    # Launch the second client in a subprocess.
    proc = subprocess.Popen([sys.executable, __file__, "--child"])

    # Wait until the flag file is created by the second process.
    flag_file = Path(db_path) / "ready.txt"
    while not flag_file.exists():
        time.sleep(0.1)

    # Now that the flag file exists, upload a second document.
    doc2 = Document(page_content="Second hello")
    client1.add_documents([doc2])
   
    print_search("# MAIN CLIENT Search 2", client1)

    # Remove the flag file to signal the second client to continue.
    flag_file.unlink(missing_ok=True)

    # Wait for the second client process to finish.
    proc.wait()

def second_client_task() -> None:
    # Create the second client.
    client2 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )
    # Run the first query.
    print_search("# SUBPROCESS CLIENT Search 1", client2, )

    flag_file = (Path(db_path)/"ready.txt")
    flag_file.write_text('ready for second upload')
    
    while flag_file.exists():
        time.sleep(0.1)

    print_search("# SUBPROCESS CLIENT Search 2", client2, )

    # Create a third client without clearing any underlying caches
    client3 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )

    print_search("# SUBPROCESS NEW CLIENT Search 2", client3, )



    # Clear the built in caches for underlying components
    from chromadb.api.shared_system_client import SharedSystemClient
    identifier = SharedSystemClient._get_identifier_from_settings(
        client3._client_settings
    )
    system = SharedSystemClient._identifier_to_system.get(identifier)
    if system:
        system._instances.clear()

    client4 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )

    print_search("# SUBPROCESS FINAL CLIENT Search 2", client4, )

if __name__ == "__main__":
    if "--child" in sys.argv:
        second_client_task()
    else:
        main()

Code output

=====================================
# MAIN CLIENT Search 1
 - GET
     - 2783b001...:  Hello world
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# SUBPROCESS CLIENT Search 1
 - GET
     - 2783b001...:  Hello world
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# MAIN CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 2, updating n_results = 2
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
=====================================
# SUBPROCESS CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# SUBPROCESS NEW CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# SUBPROCESS FINAL CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 2, updating n_results = 2
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello

Versions

chromadb==0.6.3
langchain==0.3.18
langchain-chroma==0.2.1
langchain-community==0.3.17
langchain-core==0.3.34
pysqlite3==0.5.4
pysqlite3-binary==0.5.4

Relevant log output

The text was updated successfully, but these errors were encountered:

tazarov · 2025-02-16T08:45:36Z

hey @amd-tibbetso, Chroma has various internal persistent mechanisms to ensure your data is safely stored. However the way that single node Chroma works does not lend itself to multi-process access.

The workflow for adding data in Chroma looks (on a high-level) like this:

Out of the above components - WAL and HNSW hold in-memory state which is not shared/visible across processes.

Here is a sequence diagram to roughly explain what data each process sees. Note that things vary depending on when each process was started as some data/state is loaded at startup. The diagram assumes both processes are started at the same time:

Bottom line is that current single node persistence model is not process safe and can lead to corruption and data loss.

amd-tibbetso · 2025-02-16T19:02:13Z

I understand that two modification accesses are not safe in two threads, and that this behavior in that sense is not a bug. However, as in my example, "Chroma2" is not modifying anything, only querying the vector DB.

However, it still seems extremely misleading (without access to this issue and the great diagram above) for get to show those updates from "Chroma1", but a query from "Chroma2" which has not added ANY DB modifications (and thus has nothing to override the changes from Chroma1) still cannot see those vector updates even after the process with Chroma1 has completed.

On top of this disconnect, the fact that within process2, removing and then starting a new "Chroma2" does not read in those updates without clearing deeply obscure object caches makes it very difficult for someone to try to work around the multi-process limitation by closing and reopening the client to the persistent directory (which to me seems a very intuitive thing to try for the read-only process). This means that even for a process doing read-only actions like "Chroma2" above, there seems to be no documented way to easily "reset" a connection to view updates (even after Chroma1 has closed) due to the fact that opening a new client re-uses the cached HNSW objects. At minimum this seems like it should be addressed with documentation saying that it is impossible to view vector updates with any process started before the "update process" completes. And, if not "impossible", then documentation on a recommended way to safely clear all the caches and reset a connection within a process that just needs up-to-date read access.

HammadB · 2025-02-16T19:21:30Z

Hi - local chroma is not process safe. We don't intend to support usage across processes in the near term and suggest against attempting to use it this way.

There is in memory state that will not be reflected in the second process.

May I ask what you are trying to achieve?

amd-tibbetso · 2025-02-16T19:29:20Z

As I mentioned at the beginning, the intent is what seems like a very innocuous use-case.

(1) Process 1: a GUI hosted on a web browser, ideally permanently running. (given the multi-process unsafe limitation, the GUI will be read-only database access).
(2) Process 2: scripts for admins to update the database with improved/new documents. The CLI scripts are more convenient and more automated than managing an admin portal on the GUI.

If process 1 never modifies the DB, it seems extremely reasonable to check for DB updates. It can already see that database entries are added (and retrieve their texts with get), but it can never perform a query that includes the new vectors.

HammadB · 2025-02-16T19:41:49Z

Got it.

While this isn't something we want to support explicitly, as without cross process safeguards that prevent unknowing users from accidentally corrupting their data, we'd not want to officially support these sorts of flows.

I can likely suggest a workaround that does what you want.

I can try to put together an example tomorrow evening for you.

amd-tibbetso · 2025-02-16T23:27:11Z

If thats the case, which is entirely fair, then I do have this workaround at this time - also already in my minimal code above.

# Clear the built in caches for underlying components
    from chromadb.api.shared_system_client import SharedSystemClient
    identifier = SharedSystemClient._get_identifier_from_settings(
        client3._client_settings
    )
    system = SharedSystemClient._identifier_to_system.get(identifier)
    if system:
        system._instances.clear()

If you think this way is inherently unsafe compared to something you have in mind, please advise alternatively. Otherwise, I can make do.

Thank you very much :)

tazarov · 2025-02-17T16:44:03Z

@amd-tibbetso, just out of curiosity, are not able to run a local server? Then use chromadb.HttpClient() to connect to it.

amd-tibbetso · 2025-02-22T16:53:21Z

@amd-tibbetso, just out of curiosity, are not able to run a local server? Then use chromadb.HttpClient() to connect to it.

I don't see this as feasible, as our use case doesn't generally assume a permanent server being available - but let me ask further to see if the benefits are worth it.

Process 0: start the chroma server in parallel to process 1, from the persistent directory
Process 1: our GUI, relatively long-running. Switches over to HttpClient, generally read only.
Process 2: our CLI big batch updates, also switches to HttpClient and sends new vectors via client instead, are you saying those new vectors would be available in process 1 for retrieval?

tazarov · 2025-02-22T17:49:36Z

@amd-tibbetso, indeed if process 0 is a server then all updates or queries (get() or query()) will view the same state.

There are many ways to run a local server:

Chroma cli chroma run
Docker/Docker compose
K8s

I even did an experiment where you can run a local server via unix domain sockets.

Running a server gives you the benefit of scale and if your use case and needs for scale require it you can move outside of the confines of a single host. E.g. Chroma cloud :)

amd-tibbetso added the bug Something isn't working label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: `Query` does not match `Get` on persistent DB accessed by two processes. #3792

[Bug]: `Query` does not match `Get` on persistent DB accessed by two processes. #3792

amd-tibbetso commented Feb 13, 2025

tazarov commented Feb 16, 2025

amd-tibbetso commented Feb 16, 2025

HammadB commented Feb 16, 2025

amd-tibbetso commented Feb 16, 2025

HammadB commented Feb 16, 2025 •

edited

Loading

amd-tibbetso commented Feb 16, 2025

tazarov commented Feb 17, 2025 •

edited

Loading

amd-tibbetso commented Feb 22, 2025

tazarov commented Feb 22, 2025

[Bug]: Query does not match Get on persistent DB accessed by two processes. #3792

[Bug]: Query does not match Get on persistent DB accessed by two processes. #3792

Comments

amd-tibbetso commented Feb 13, 2025

What happened?

Description

Minimal Example

Code output

Versions

Relevant log output

tazarov commented Feb 16, 2025

amd-tibbetso commented Feb 16, 2025

HammadB commented Feb 16, 2025

amd-tibbetso commented Feb 16, 2025

HammadB commented Feb 16, 2025 • edited Loading

amd-tibbetso commented Feb 16, 2025

tazarov commented Feb 17, 2025 • edited Loading

amd-tibbetso commented Feb 22, 2025

tazarov commented Feb 22, 2025

[Bug]: `Query` does not match `Get` on persistent DB accessed by two processes. #3792

[Bug]: `Query` does not match `Get` on persistent DB accessed by two processes. #3792

HammadB commented Feb 16, 2025 •

edited

Loading

tazarov commented Feb 17, 2025 •

edited

Loading