Skip to content
This repository has been archived by the owner on Jan 22, 2022. It is now read-only.

Latest commit

 

History

History
11 lines (7 loc) · 642 Bytes

readme.md

File metadata and controls

11 lines (7 loc) · 642 Bytes

Demo for News Article Collection and Volume Reduction Pipeline

Binder

Brief notebooks that run through the following processes using a dataset of NYT Front Page articles

  1. Find efficient keywords with word embeddings (gensim)
  2. Remove duplicitous articles with cosine similarity on TFIDF vectors (scikit-learn)
  3. Remove duplicitous articles with entity extraction and jaccard similarity (spacy)
  4. Classify relevant articles (scikit-learn)