Classify whether an article/text describes an disease. Neural-network variants:
- Binary Classification of whether an article accurately describes an disease
- Figure out the disease it might be referring to
Obtained from Wikipedia by scraping through articles
- Gather articles pertaining to diseases and otherwise using wget
- Label each article depending on if it pertains to an disease: isDisease
- HTML Parser to scrape through the essentials from the html document
LogisticRegression from Scikit-learn is used for:
- Feature extraction and transformation
- Logistic Regression Classifier is used to train on the dataset and test on the testing dataset
Keras with tensorflow as the backend and scikit-learn for feature extraction:
- Sentences are extracted from the article and vectorized using CountVectorizer from the scikit-learn library
- Sequential deep neural network model with 10 layers with relu activation and adam optimizer is used to train on the data
- Verification is accomplished by splitting the dataset into training and test datasets
Keras with tensorflow as the backend and scikit-learn for feature extraction:
- Vectorized sentences are tagged along with the disease labels
- Labels are LabelEncoded and transformed into a OneHotVector to be processed by the DNN model
- Sequential deep neural network model with 10 layers and output layer with multiple classes is used to train on the data
Models created for both the parts are trained for sample data-sets and stored as .h5 files
Using these, runner modules could be leveraged to provide user with a script to test out DNN model on-demand
$ cd {Part}/trained_models
$ python runner.py
$ (enter text to be classified)
Model performs satisfactorily well, but with caveats. Future scope includes:
- Experimentation with Word embeddings and Glove bag-of-words
- Convolutional Neural Networks and Deep-NLP