-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching opened tantivy.Index
es in the package
#627
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great battle - glad it's solved
index_path = await self.index_filename | ||
if await (index_path / "meta.json").exists(): | ||
self._index = Index.open(path=str(index_path)) # type: ignore[assignment] | ||
index_meta_directory = await self.index_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can hit a race condition here (because the exists check is awaitable). We should prob. acquire a lock and then do the check or do the check synchronously.
Otherwise I think a "gather" call could result in several False
responses to this exists
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok we talked about this, because we are only instantiating and Index and not opening one, this likely isn't an issue. @jamesbraza is going to add a comment for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep thanks for the discussion! I added a comment to the code documenting some of our talking points
…directory_index per trajectory
…all get_directory_index per trajectory" This reverts commit e7f4519.
with added test assertions
…ementation is used
7b39d7a
to
fe6ed72
Compare
Motivation
quickwit-oss/tantivy-py#359 (comment) reveals that one of our servers can only handle three opened indexes at once. The reason why remains unclear, but this PR at least reacts to the issue by caching opened indexes in the
search.py
module's scope. Now we can invokeawait SearchIndex(index_name="normal-index").index
as many times as desired within one Python process.Implementation Details
Why a global scope? We want to accommodate caching the
Index
across:deepcopy
of anPaperQAEnvironment
, whosetools
which contains aPaperSearch
tool instance.Index
inPaperSearch
, or else we'd be makingIndex
copies (and the side effects of this are unknown)Trajectory
, where we (1) build the index and (2) use the index in 0+ paper searchesTaskDataset
, where we (1) build the index and (2) run 0+ envs for one trajectory eachTaskDataset
, for larger experimentsThis can only be accomplished using global scope, whose lifetime matches the entire Python process. This unfortunately requires callers to invoke the newly added
reap_opened_index_cache
at runtime if intermediary cleaning of the cache is desired.The caching added here can be disabled by setting
1
ortrue
(case insensitive) to the newly-added environment variablePQA_INDEX_DONT_CACHE_INDEXES
.Risks
reap_opened_index_cache
while also usingPaperSearch
reap_opened_index_cache
to avoid this, but the tradeoff is our testing slowly accrues indexes across cases