Skip to content

SlawCzech/docs_classification_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beerware License

Document classification

Document classification is an application designed for prediction of scanned documents to 21 predefined categories. It is based on two-step pipeline. First, after reading the document, it is assessed by trained Linear Regression model to find out a probability of prediction. If it is above 90%, the predicted class is returned. Otherwise, the object is transferred to Convolutional Neural Network model for final prediction.

The line of reasoning is shown on the flow diagram below (alas in Polish only).

The prediction accuracy for Linear Regression model is 73% (weighted average) and 93.3% for CNN model. See screenshots below fo details.

The app was designed and written during HackING 24h hackathon. It was awarded 4th place out of 27 teams.

To download training data please refer to the HackING data webpage.

Tech Stack

Server: Python, Pytesseract, PyTorch, Scikit-Learn, NLTK.

How to run

Download relevant data and upload to data folder.

Run jupyter notebooks as necessary for scan recognition, data cleaning, spell check, ML prediction and/or CNN prediction.

Feel free to reproduce our results.

App flow diagram

In Polish only.

Flow diagram

Prediction accuracy

Logistic Regression model:

Logistic Regression

CNN model:

CNN

Authors

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published