Skip to content

Text classification for articles using Deep Neural Networks

License

Notifications You must be signed in to change notification settings

dishamisal/Text-Classification-DNN

Repository files navigation

Classify Disease Articles using Deep Neural Networks

N|Solid

Problem Statement

Classify whether an article/text describes an disease. Neural-network variants:

  1. Binary Classification of whether an article accurately describes an disease
  2. Figure out the disease it might be referring to

Source Dataset

Obtained from Wikipedia by scraping through articles

  • Gather articles pertaining to diseases and otherwise using wget
  • Label each article depending on if it pertains to an disease: isDisease
  • HTML Parser to scrape through the essentials from the html document

Logistic Regression (baseline)

LogisticRegression from Scikit-learn is used for:

  • Feature extraction and transformation
  • Logistic Regression Classifier is used to train on the dataset and test on the testing dataset

Binary Classification using Deep Neural Networks

Keras with tensorflow as the backend and scikit-learn for feature extraction:

  • Sentences are extracted from the article and vectorized using CountVectorizer from the scikit-learn library
  • Sequential deep neural network model with 10 layers with relu activation and adam optimizer is used to train on the data
  • Verification is accomplished by splitting the dataset into training and test datasets

Multi-label Disease classification using Deep Neural Networks

Keras with tensorflow as the backend and scikit-learn for feature extraction:

  • Vectorized sentences are tagged along with the disease labels
  • Labels are LabelEncoded and transformed into a OneHotVector to be processed by the DNN model
  • Sequential deep neural network model with 10 layers and output layer with multiple classes is used to train on the data

Runner

Models created for both the parts are trained for sample data-sets and stored as .h5 files

Using these, runner modules could be leveraged to provide user with a script to test out DNN model on-demand

$ cd {Part}/trained_models
$ python runner.py
$ (enter text to be classified)

Conclusions

Model performs satisfactorily well, but with caveats. Future scope includes:

  • Experimentation with Word embeddings and Glove bag-of-words
  • Convolutional Neural Networks and Deep-NLP

About

Text classification for articles using Deep Neural Networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages