-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Query
does not match Get
on persistent DB accessed by two processes.
#3792
Comments
hey @amd-tibbetso, Chroma has various internal persistent mechanisms to ensure your data is safely stored. However the way that single node Chroma works does not lend itself to multi-process access. The workflow for adding data in Chroma looks (on a high-level) like this: Out of the above components - WAL and HNSW hold in-memory state which is not shared/visible across processes. Here is a sequence diagram to roughly explain what data each process sees. Note that things vary depending on when each process was started as some data/state is loaded at startup. The diagram assumes both processes are started at the same time: Bottom line is that current single node persistence model is not process safe and can lead to corruption and data loss. |
I understand that two modification accesses are not safe in two threads, and that this behavior in that sense is not a bug. However, as in my example, "Chroma2" is not modifying anything, only querying the vector DB. However, it still seems extremely misleading (without access to this issue and the great diagram above) for On top of this disconnect, the fact that within process2, removing and then starting a new "Chroma2" does not read in those updates without clearing deeply obscure object caches makes it very difficult for someone to try to work around the multi-process limitation by closing and reopening the client to the persistent directory (which to me seems a very intuitive thing to try for the read-only process). This means that even for a process doing read-only actions like "Chroma2" above, there seems to be no documented way to easily "reset" a connection to view updates (even after Chroma1 has closed) due to the fact that opening a new client re-uses the cached HNSW objects. At minimum this seems like it should be addressed with documentation saying that it is impossible to view vector updates with any process started before the "update process" completes. And, if not "impossible", then documentation on a recommended way to safely clear all the caches and reset a connection within a process that just needs up-to-date read access. |
Hi - local chroma is not process safe. We don't intend to support usage across processes in the near term and suggest against attempting to use it this way. There is in memory state that will not be reflected in the second process. May I ask what you are trying to achieve? |
As I mentioned at the beginning, the intent is what seems like a very innocuous use-case. (1) Process 1: a GUI hosted on a web browser, ideally permanently running. (given the multi-process unsafe limitation, the GUI will be read-only database access). If process 1 never modifies the DB, it seems extremely reasonable to check for DB updates. It can already see that database entries are added (and retrieve their texts with |
Got it. While this isn't something we want to support explicitly, as without cross process safeguards that prevent unknowing users from accidentally corrupting their data, we'd not want to officially support these sorts of flows. I can likely suggest a workaround that does what you want. I can try to put together an example tomorrow evening for you. |
If thats the case, which is entirely fair, then I do have this workaround at this time - also already in my minimal code above. # Clear the built in caches for underlying components
from chromadb.api.shared_system_client import SharedSystemClient
identifier = SharedSystemClient._get_identifier_from_settings(
client3._client_settings
)
system = SharedSystemClient._identifier_to_system.get(identifier)
if system:
system._instances.clear() If you think this way is inherently unsafe compared to something you have in mind, please advise alternatively. Otherwise, I can make do. Thank you very much :) |
@amd-tibbetso, just out of curiosity, are not able to run a local server? Then use |
I don't see this as feasible, as our use case doesn't generally assume a permanent server being available - but let me ask further to see if the benefits are worth it. Process 0: start the chroma server in parallel to process 1, from the persistent directory |
@amd-tibbetso, indeed if process 0 is a server then all updates or queries ( There are many ways to run a local server:
I even did an experiment where you can run a local server via unix domain sockets. Running a server gives you the benefit of scale and if your use case and needs for scale require it you can move outside of the confines of a single host. E.g. Chroma cloud :) |
What happened?
Description
If two processes try to interact with the same persistent database, any updates from one are observable in the documents themselves (
get
), but not in the embedding search (query
endpoint).This is similar to #3769, but given that it is not specific to Jupyter I wanted to add separate observations.
To me, this does not seem like a far-reaching use-case as I encountered the bug in the following situation:
(1) Process 1: a GUI hosted on a web browser, ideally permanently running
(2) Process 2: scripts for admins to update the database (CLI was much more efficient for these DB updates).
However, after the second process is run, the changes to the database can be viewed in process 1 using
client.get()
, butquery
fails to see any added or removed documents.Below is a simple working example of the situation. When delving into the code to discover why I could see new documents but could not query them, I also found that clearing the _instances cache would resolve the issue, although I could not pinpoint which instance was causing it. My minimal code also shows the effect of this.
Minimal Example
Code output
Versions
chromadb==0.6.3
langchain==0.3.18
langchain-chroma==0.2.1
langchain-community==0.3.17
langchain-core==0.3.34
pysqlite3==0.5.4
pysqlite3-binary==0.5.4
Relevant log output
The text was updated successfully, but these errors were encountered: