20_Newsgroups_Text_classification

20_Newsgroup Dataset

Abstract

This data set consists of 20000 messages taken from 20 newsgroups.

Sources

Original Owner and Donor
Tom Mitchell,
School of Computer Science,
Carnegie Mellon University

Data Characteristics

One thousand Usenet articles were taken from each of the following 20 newsgroups.
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.

Data Format

Each newsgroup is stored in a subdirectory, with each article stored as a separate file.

Link to the Dataset

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

A sample article from the Dataset

Implementation

Multinomial Naive Bayes was used to classify a given article into one of the 20 newsgroups.The Multinomial Naive Bayes implementation from sklearn aswell as my own self implementation were used for the classification and their results were compared.It was seen that both the implementations gave the exact same results hence both the implementations must be same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

20_Newsgroups_Text_classification

20_Newsgroup Dataset

Abstract

Sources

Data Characteristics

Data Format

Link to the Dataset

A sample article from the Dataset

Implementation

Accuracy

Sklearn's implementation

Self implementation

Files

README.md

Latest commit

History

README.md

File metadata and controls

20_Newsgroups_Text_classification

20_Newsgroup Dataset

Abstract

Sources

Data Characteristics

Data Format

Link to the Dataset

A sample article from the Dataset

Implementation

Accuracy

Sklearn's implementation

Self implementation