Ceegle-search

Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app

Using webscrapper extracting title and url from 4 pages of google search.
Extracting information from twitter for the search key with #conspiracy.[Positive labels]
Extracting information from twitter for the search key.[Negative labels]
The above will be used as the training set.
The extracted text then has to be cleaned. It is a part of NLP
- Removing symbols, single characters, numbers, etc.,
- Removing stop words
- POS(Part-of-speech) tagger is used and extracted "noun, verb, adjective" that are relative.
- Lemmatizing based on POS(P)(eg: going, go => go)
- There will always be more unwanted words that appear rarely. So based on frequency top 30 words from both positive and negative words are choosen.
- The others words were removed.
- This is converted to TFIDF(Term frequency Inverse Document Frequency) vectorizor.
This was converted into a dataframe. To ascess the frequency of each words.
Using KNN(k-nearest neighbor) Algorithm the classification algorithm was implemented from scratch. The distance metric used is euclidean distance with k=7.Compared to Random forest algorithm this seemed to work well for the prediction.
From the google search results, all the cleaning process done for the trainig set has been repeated.
Then the KNN algorithm was used to test the classification results. The application was converted into the gif file for easy understanding.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
static		static
templates		templates
LICENSE		LICENSE
MySearch.py		MySearch.py
README.md		README.md
ceegleSearch.gif		ceegleSearch.gif

Provide feedback