Document classification is an application designed for prediction of scanned documents to 21 predefined categories. It is based on two-step pipeline. First, after reading the document, it is assessed by trained Linear Regression model to find out a probability of prediction. If it is above 90%, the predicted class is returned. Otherwise, the object is transferred to Convolutional Neural Network model for final prediction.
The line of reasoning is shown on the flow diagram below (alas in Polish only).
The prediction accuracy for Linear Regression model is 73% (weighted average) and 93.3% for CNN model. See screenshots below fo details.
The app was designed and written during HackING 24h hackathon. It was awarded 4th place out of 27 teams.
To download training data please refer to the HackING data webpage.
Server: Python, Pytesseract, PyTorch, Scikit-Learn, NLTK.
Download relevant data and upload to data
folder.
Run jupyter notebooks as necessary for scan recognition, data cleaning, spell check, ML prediction and/or CNN prediction.
Feel free to reproduce our results.
In Polish only.