Document classification

Document classification is an application designed for prediction of scanned documents to 21 predefined categories. It is based on two-step pipeline. First, after reading the document, it is assessed by trained Linear Regression model to find out a probability of prediction. If it is above 90%, the predicted class is returned. Otherwise, the object is transferred to Convolutional Neural Network model for final prediction.

The line of reasoning is shown on the flow diagram below (alas in Polish only).

The prediction accuracy for Linear Regression model is 73% (weighted average) and 93.3% for CNN model. See screenshots below fo details.

The app was designed and written during HackING 24h hackathon. It was awarded 4th place out of 27 teams.

To download training data please refer to the HackING data webpage.

Tech Stack

Server: Python, Pytesseract, PyTorch, Scikit-Learn, NLTK.

How to run

Download relevant data and upload to data folder.

Run jupyter notebooks as necessary for scan recognition, data cleaning, spell check, ML prediction and/or CNN prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
helpers		helpers
models		models
notebooks		notebooks
screenshots		screenshots
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document classification

Tech Stack

How to run

App flow diagram

Prediction accuracy

Logistic Regression model:

CNN model:

Authors

About

Releases

Packages

Languages

License

SlawCzech/docs_classification_ML

Folders and files

Latest commit

History

Repository files navigation

Document classification

Tech Stack

How to run

App flow diagram

Prediction accuracy

Logistic Regression model:

CNN model:

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages