Skip to content

Latest commit

 

History

History
48 lines (46 loc) · 1.93 KB

File metadata and controls

48 lines (46 loc) · 1.93 KB

20_Newsgroups_Text_classification

20_Newsgroup Dataset

Abstract

This data set consists of 20000 messages taken from 20 newsgroups.

Sources

Original Owner and Donor
Tom Mitchell,
School of Computer Science,
Carnegie Mellon University

Data Characteristics

One thousand Usenet articles were taken from each of the following 20 newsgroups.
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.

Data Format

Each newsgroup is stored in a subdirectory, with each article stored as a separate file.

Link to the Dataset

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

A sample article from the Dataset

Screen Shot 2019-03-31 at 4 38 13 AM

Implementation

Multinomial Naive Bayes was used to classify a given article into one of the 20 newsgroups.The Multinomial Naive Bayes implementation from sklearn aswell as my own self implementation were used for the classification and their results were compared.It was seen that both the implementations gave the exact same results hence both the implementations must be same.

Accuracy

Sklearn's implementation

Testing accuracy : 82.16%
F1 score: 0.82

Self implementation

F1 score: 0.82