The Jochre Search server provides OCR search functionality.
This search engine is built on top of Apache Lucene, and stores coordinates for all words in a separate database. When a search is performed, it enables query extension to all words sharing the same lemma as the search term. Using the stored word coordinates, it can also highlight terms matching the search results in the original image. Finally, it includes functionality to allow users to crowd-source OCR corrections, as well as corrections to the book's metadata.
Note that the Jochre 3 OCR search engine is completely independent from the Jochre 3 OCR generation software. All it requires is a PDF (or a set of page images) and corresponding Alto files produced by any OCR tool.
If you use this software in your studies, please cite the following article:
@misc{urieli2025jochre3yiddishocr,
title={Jochre 3 and the Yiddish OCR corpus},
author={Assaf Urieli and Amber Clooney and Michelle Sigiel and Grisha Leyfer},
year={2025},
eprint={2501.08442},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.08442},
}
First, open the console and clone the project using git
(or you can simply download the project) and then change the directory:
git clone [email protected]:joliciel/jochre3-search.git
cd jochre3-search
Navigate to the project directory, and create a file docker-compose/jochre.conf
based on docker-compose.jochre-sample.conf
.
Run the docker-compose script:
docker-compose -p jochre -f docker-compose/jochre3-search.yml up
You can then navigate to the Swagger documentation as follows: http://localhost:4242/docs/
If the application is setup for raw authorization, click on Authorize
, and enter the following string:
{ "username": "Test", "email": "[email protected]", "roles": ["index"] }
Navigate to the project directory, and run the application as follows:
make init-dev-env
sbt
project api
run
You can then navigate to the Swagger documentation as follows: http://localhost:4242/docs/
See above for authorization.
Create your environment variables:
export JOCHRE3_DOCKER_REGISTRY=registry.gitlab.com
export [email protected]
Either create an additional JOCHRE3_DOCKER_PASSWORD variable, or (more secure) add the password to pass:
pass insert jochre/sonatype_deploy
Run the publish script
make publish-image
For the CI script in .gitlab-ci.yml to work, you first need to set up a runner with gitlab-runner
(the docker image below needs to correspond to the one in the yml file):
sudo gitlab-runner register \
--non-interactive \
--url "https://gitlab.com/" \
--registration-token $REGISTRATION_TOKEN \
--executor "docker" \
--description "Docker runner" \
--docker-image "docker:24" \
--docker-privileged
It will automatically deploy a docker image on tags, on the condition that you add two variables to the Gitlab CI settings: JOCHRE3_DOCKER_USERNAME
and JOCHRE3_DOCKER_PASSWORD
.