Tip-of-the-tongue – when a person fails to retrieve a word from memory – poses a difficulty for image search, such as for online shopping. We propose a workaround to query images from a database by doodling the object of interest.
We do so by constructing a model that represents doodles and real images in the same embedding space, then select real images that are closest to the doodle drawn. We believe our proof-of-concept can complement Google's existing reverse image search that does not take in doodles as input.
We aim to build an image vector search engine, consisting of a database of real-life images, that takes in a doodle sketch and returns the top real-life images most relevant or similar. We study the effect of model architecture (MLP, CNN, ConvNeXt) and learning paradigm (supervised, contrastive learning) on deep learning training for our problem.
This is a course project for the module CS4243 Computer Vision and Pattern Recognition at the National University of Singapore, instructed by Prof Xavier Bresson. We received the top scores among the cohort, but the repo is no longer under maintanance. Here is the project page.