The aim of this project is to design and train a model that is able to read images of scanned Arabic documents and generate the text written in those images.
This project implements a complete Machine Learning pipeline, i.e., the project includes (but not limited to) the following modules:
- preprocessing module
- feature extraction/selection module
- model selection and training module
- performance analysis module
A dataset of images and its ground truth text was obtained from the Watan-2004 Arabic text corpus, compiled by Dr. Mourad Abbas (http://sites.google.com/site/mouradabbas9/corpora)
N.B.: This corpus is only for scientific use. However, any use of it in order to create and release other ressources or software must have the authorization of Mourad Abbas.
- python3
- numpy
- opencv
- skimage
- scipy
- matplotlib
$ python3 ./ocr.py