Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching opened tantivy.Indexes in the package #627

Merged
merged 8 commits into from
Oct 23, 2024

Conversation

jamesbraza
Copy link
Collaborator

@jamesbraza jamesbraza commented Oct 22, 2024

Motivation

quickwit-oss/tantivy-py#359 (comment) reveals that one of our servers can only handle three opened indexes at once. The reason why remains unclear, but this PR at least reacts to the issue by caching opened indexes in the search.py module's scope. Now we can invoke await SearchIndex(index_name="normal-index").index as many times as desired within one Python process.

Implementation Details

Why a global scope? We want to accommodate caching the Index across:

  • Across deepcopy of an PaperQAEnvironment, whose tools which contains a PaperSearch tool instance.
    • So we can't cache the Index in PaperSearch, or else we'd be making Index copies (and the side effects of this are unknown)
  • Across a Trajectory, where we (1) build the index and (2) use the index in 0+ paper searches
  • Across a TaskDataset, where we (1) build the index and (2) run 0+ envs for one trajectory each
  • Across many TaskDataset, for larger experiments

This can only be accomplished using global scope, whose lifetime matches the entire Python process. This unfortunately requires callers to invoke the newly added reap_opened_index_cache at runtime if intermediary cleaning of the cache is desired.

The caching added here can be disabled by setting 1 or true (case insensitive) to the newly-added environment variable PQA_INDEX_DONT_CACHE_INDEXES.

Risks

  • Race conditions if invoking reap_opened_index_cache while also using PaperSearch
    • Our test suite intentionally doesn't use reap_opened_index_cache to avoid this, but the tradeoff is our testing slowly accrues indexes across cases
  • Since caching is enabled by default, if the clients aren't aware of this, it can be considered unexpected statefulness

@jamesbraza jamesbraza added the bug Something isn't working label Oct 22, 2024
@jamesbraza jamesbraza self-assigned this Oct 22, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Oct 22, 2024
Copy link
Collaborator

@whitead whitead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great battle - glad it's solved

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 22, 2024
index_path = await self.index_filename
if await (index_path / "meta.json").exists():
self._index = Index.open(path=str(index_path)) # type: ignore[assignment]
index_meta_directory = await self.index_filename
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can hit a race condition here (because the exists check is awaitable). We should prob. acquire a lock and then do the check or do the check synchronously.

Otherwise I think a "gather" call could result in several False responses to this exists call.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we talked about this, because we are only instantiating and Index and not opening one, this likely isn't an issue. @jamesbraza is going to add a comment for future reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep thanks for the discussion! I added a comment to the code documenting some of our talking points

@jamesbraza jamesbraza force-pushed the fixing-index-reopening branch from 7b39d7a to fe6ed72 Compare October 22, 2024 23:51
@jamesbraza jamesbraza merged commit 96dca14 into main Oct 23, 2024
5 checks passed
@jamesbraza jamesbraza deleted the fixing-index-reopening branch October 23, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants