Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Query does not match Get on persistent DB accessed by two processes. #3792

Open
amd-tibbetso opened this issue Feb 13, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@amd-tibbetso
Copy link

What happened?

Description

If two processes try to interact with the same persistent database, any updates from one are observable in the documents themselves (get), but not in the embedding search (query endpoint).

This is similar to #3769, but given that it is not specific to Jupyter I wanted to add separate observations.

To me, this does not seem like a far-reaching use-case as I encountered the bug in the following situation:
(1) Process 1: a GUI hosted on a web browser, ideally permanently running
(2) Process 2: scripts for admins to update the database (CLI was much more efficient for these DB updates).

However, after the second process is run, the changes to the database can be viewed in process 1 using client.get(), but query fails to see any added or removed documents.

Below is a simple working example of the situation. When delving into the code to discover why I could see new documents but could not query them, I also found that clearing the _instances cache would resolve the issue, although I could not pinpoint which instance was causing it. My minimal code also shows the effect of this.

system = SharedSystemClient._identifier_to_system.get(identifier)
if system:
    system._instances.clear()

Minimal Example

# my system had out of date sqlite3
import sys
__import__("pysqlite3")
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

# sorry for having langchain in the example case.
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_community.embeddings import FakeEmbeddings
from pathlib import Path
import time

db_path = "test-db"
collection_name = "docs"


def print_search(message, client: Chroma):
    print('=====================================')
    print(message)
    docs = client.get()
    print(' - GET')
    for uuid, text in zip(docs['ids'], docs['documents']):
        print(f"     - {uuid[0:8]}...:  {text}")
    docs = client.similarity_search("Hello", k=3)
    print(' - SIMILARITY SEARCH')
    for doc in docs:
        print(f"     - {doc.id[0:8]}...:  {doc.page_content}")


def main() -> None:
    import shutil
    import subprocess, sys

    if Path(db_path).exists():
        shutil.rmtree(db_path)

    # Create the first client and add the initial document.
    client1 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )
    doc1 = Document(page_content="Hello world")
    client1.add_documents([doc1])
    
    print_search("# MAIN CLIENT Search 1", client1)

    # Launch the second client in a subprocess.
    proc = subprocess.Popen([sys.executable, __file__, "--child"])

    # Wait until the flag file is created by the second process.
    flag_file = Path(db_path) / "ready.txt"
    while not flag_file.exists():
        time.sleep(0.1)

    # Now that the flag file exists, upload a second document.
    doc2 = Document(page_content="Second hello")
    client1.add_documents([doc2])
   
    print_search("# MAIN CLIENT Search 2", client1)

    # Remove the flag file to signal the second client to continue.
    flag_file.unlink(missing_ok=True)

    # Wait for the second client process to finish.
    proc.wait()

def second_client_task() -> None:
    # Create the second client.
    client2 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )
    # Run the first query.
    print_search("# SUBPROCESS CLIENT Search 1", client2, )

    flag_file = (Path(db_path)/"ready.txt")
    flag_file.write_text('ready for second upload')
    
    while flag_file.exists():
        time.sleep(0.1)

    print_search("# SUBPROCESS CLIENT Search 2", client2, )

    # Create a third client without clearing any underlying caches
    client3 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )

    print_search("# SUBPROCESS NEW CLIENT Search 2", client3, )



    # Clear the built in caches for underlying components
    from chromadb.api.shared_system_client import SharedSystemClient
    identifier = SharedSystemClient._get_identifier_from_settings(
        client3._client_settings
    )
    system = SharedSystemClient._identifier_to_system.get(identifier)
    if system:
        system._instances.clear()

    client4 = Chroma(
        collection_name,
        persist_directory=db_path,
        embedding_function=FakeEmbeddings(size=10)
    )

    print_search("# SUBPROCESS FINAL CLIENT Search 2", client4, )

if __name__ == "__main__":
    if "--child" in sys.argv:
        second_client_task()
    else:
        main()

Code output

=====================================
# MAIN CLIENT Search 1
 - GET
     - 2783b001...:  Hello world
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# SUBPROCESS CLIENT Search 1
 - GET
     - 2783b001...:  Hello world
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# MAIN CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 2, updating n_results = 2
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
=====================================
# SUBPROCESS CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# SUBPROCESS NEW CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 1, updating n_results = 1
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
=====================================
# SUBPROCESS FINAL CLIENT Search 2
 - GET
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello
Number of requested results 3 is greater than number of elements in index 2, updating n_results = 2
 - SIMILARITY SEARCH
     - 2783b001...:  Hello world
     - 3b7d0533...:  Second hello

Versions

chromadb==0.6.3
langchain==0.3.18
langchain-chroma==0.2.1
langchain-community==0.3.17
langchain-core==0.3.34
pysqlite3==0.5.4
pysqlite3-binary==0.5.4

Relevant log output

@amd-tibbetso amd-tibbetso added the bug Something isn't working label Feb 13, 2025
@tazarov
Copy link
Contributor

tazarov commented Feb 16, 2025

hey @amd-tibbetso, Chroma has various internal persistent mechanisms to ensure your data is safely stored. However the way that single node Chroma works does not lend itself to multi-process access.

The workflow for adding data in Chroma looks (on a high-level) like this:

Image

Out of the above components - WAL and HNSW hold in-memory state which is not shared/visible across processes.

Here is a sequence diagram to roughly explain what data each process sees. Note that things vary depending on when each process was started as some data/state is loaded at startup. The diagram assumes both processes are started at the same time:

Image

Bottom line is that current single node persistence model is not process safe and can lead to corruption and data loss.

@amd-tibbetso
Copy link
Author

I understand that two modification accesses are not safe in two threads, and that this behavior in that sense is not a bug. However, as in my example, "Chroma2" is not modifying anything, only querying the vector DB.

However, it still seems extremely misleading (without access to this issue and the great diagram above) for get to show those updates from "Chroma1", but a query from "Chroma2" which has not added ANY DB modifications (and thus has nothing to override the changes from Chroma1) still cannot see those vector updates even after the process with Chroma1 has completed.

On top of this disconnect, the fact that within process2, removing and then starting a new "Chroma2" does not read in those updates without clearing deeply obscure object caches makes it very difficult for someone to try to work around the multi-process limitation by closing and reopening the client to the persistent directory (which to me seems a very intuitive thing to try for the read-only process). This means that even for a process doing read-only actions like "Chroma2" above, there seems to be no documented way to easily "reset" a connection to view updates (even after Chroma1 has closed) due to the fact that opening a new client re-uses the cached HNSW objects. At minimum this seems like it should be addressed with documentation saying that it is impossible to view vector updates with any process started before the "update process" completes. And, if not "impossible", then documentation on a recommended way to safely clear all the caches and reset a connection within a process that just needs up-to-date read access.

@HammadB
Copy link
Collaborator

HammadB commented Feb 16, 2025

Hi - local chroma is not process safe. We don't intend to support usage across processes in the near term and suggest against attempting to use it this way.

There is in memory state that will not be reflected in the second process.

May I ask what you are trying to achieve?

@amd-tibbetso
Copy link
Author

As I mentioned at the beginning, the intent is what seems like a very innocuous use-case.

(1) Process 1: a GUI hosted on a web browser, ideally permanently running. (given the multi-process unsafe limitation, the GUI will be read-only database access).
(2) Process 2: scripts for admins to update the database with improved/new documents. The CLI scripts are more convenient and more automated than managing an admin portal on the GUI.

If process 1 never modifies the DB, it seems extremely reasonable to check for DB updates. It can already see that database entries are added (and retrieve their texts with get), but it can never perform a query that includes the new vectors.

@HammadB
Copy link
Collaborator

HammadB commented Feb 16, 2025

Got it.

While this isn't something we want to support explicitly, as without cross process safeguards that prevent unknowing users from accidentally corrupting their data, we'd not want to officially support these sorts of flows.

I can likely suggest a workaround that does what you want.

I can try to put together an example tomorrow evening for you.

@amd-tibbetso
Copy link
Author

If thats the case, which is entirely fair, then I do have this workaround at this time - also already in my minimal code above.

# Clear the built in caches for underlying components
    from chromadb.api.shared_system_client import SharedSystemClient
    identifier = SharedSystemClient._get_identifier_from_settings(
        client3._client_settings
    )
    system = SharedSystemClient._identifier_to_system.get(identifier)
    if system:
        system._instances.clear()

If you think this way is inherently unsafe compared to something you have in mind, please advise alternatively. Otherwise, I can make do.

Thank you very much :)

@tazarov
Copy link
Contributor

tazarov commented Feb 17, 2025

@amd-tibbetso, just out of curiosity, are not able to run a local server? Then use chromadb.HttpClient() to connect to it.

@amd-tibbetso
Copy link
Author

@amd-tibbetso, just out of curiosity, are not able to run a local server? Then use chromadb.HttpClient() to connect to it.

I don't see this as feasible, as our use case doesn't generally assume a permanent server being available - but let me ask further to see if the benefits are worth it.

Process 0: start the chroma server in parallel to process 1, from the persistent directory
Process 1: our GUI, relatively long-running. Switches over to HttpClient, generally read only.
Process 2: our CLI big batch updates, also switches to HttpClient and sends new vectors via client instead, are you saying those new vectors would be available in process 1 for retrieval?

@tazarov
Copy link
Contributor

tazarov commented Feb 22, 2025

@amd-tibbetso, indeed if process 0 is a server then all updates or queries (get() or query()) will view the same state.

There are many ways to run a local server:

  • Chroma cli chroma run
  • Docker/Docker compose
  • K8s

I even did an experiment where you can run a local server via unix domain sockets.

Running a server gives you the benefit of scale and if your use case and needs for scale require it you can move outside of the confines of a single host. E.g. Chroma cloud :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants