Skip to content

Text classification using Multinomial NB on 20_newsgroups dataset.

License

Notifications You must be signed in to change notification settings

AmritK10/20_Newsgroups_Text_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

20_Newsgroups_Text_classification

20_Newsgroup Dataset

Abstract

This data set consists of 20000 messages taken from 20 newsgroups.

Sources

Original Owner and Donor
Tom Mitchell,
School of Computer Science,
Carnegie Mellon University

Data Characteristics

One thousand Usenet articles were taken from each of the following 20 newsgroups.
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.

Data Format

Each newsgroup is stored in a subdirectory, with each article stored as a separate file.

Link to the Dataset

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

A sample article from the Dataset

Screen Shot 2019-03-31 at 4 38 13 AM

Implementation

Multinomial Naive Bayes was used to classify a given article into one of the 20 newsgroups.The Multinomial Naive Bayes implementation from sklearn aswell as my own self implementation were used for the classification and their results were compared.It was seen that both the implementations gave the exact same results hence both the implementations must be same.

Accuracy

Sklearn's implementation

Testing accuracy : 82.16%
F1 score: 0.82

Self implementation

F1 score: 0.82

About

Text classification using Multinomial NB on 20_newsgroups dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published