Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text OCR search across all volumes (#944, #945) #953

Merged
merged 7 commits into from
Nov 15, 2023

Conversation

blms
Copy link
Contributor

@blms blms commented Nov 6, 2023

In this PR

Per #944:

  • Index full OCR text for all documents, with stemming
  • Add searching and highlighting on that field to view and template

Per #945:

  • In indexing, account for multiple hits within the same full text by nesting canvas within document, and including page numbers and ids with each canvas
  • In frontend, group them together by page number, and link each result to individual page
  • Limit to top 3 pages in a volume sorted by score

Notes

This will require a reindex with Elastic. The command is:

python manage.py search_index --rebuild

I have a couple of questions about this:

  1. This is going to create quite a large index because it has to index the full text with analysis of every document. It also might eat up significant resources while creating the index. Do you have a way of testing this in a similar environment to prod, or would that just be the dev server? I would be curious about any space or hardware limitations that you run into and if there end up being errors.
    • FWIW, I tried to improve the prefetching, but I was having trouble telling if it was working or not locally.
  2. Would it be helpful to have somewhere to track systems tasks like this that need to be done on deployment, by version number, like a "deploy notes" file? Or is there already a mechanism somewhere for that?

@blms blms requested a review from jayvarner November 6, 2023 19:59
@blms blms force-pushed the feature/944-ocr-search branch from 00f1bed to 8b13757 Compare November 15, 2023 18:48
@jayvarner
Copy link
Member

Currently, the Fabric based deploy does reindex everything. I'm moving deployment to AWS' Elastic Container Service and will include reindexing along with the other commands the container will run. But yeah, I can add some deploy notes that summarizes the deploy process so we can keep track of everything.

@jayvarner jayvarner merged commit 93bef5e into develop Nov 15, 2023
@blms blms deleted the feature/944-ocr-search branch November 15, 2023 20:09
@blms
Copy link
Contributor Author

blms commented Nov 15, 2023

@jayvarner I'm especially thinking about a versioned set of notes, something like this that keeps track of deployment process changes per version, but whatever is most helpful for you is fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants