This is the implementation of the paper "Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval".
First, install PyTorch 1.7.1 (or later) and torchvision, and install the CLIP as a Python package.
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
Then, install small additional dependencies.
$ pip install -r requirements.txt
The MLT dataset can be download from here
The PSTR (Phrase-level Scene Text Retrieval) dataset can be download from here