Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app
- Using webscrapper extracting title and url from 4 pages of google search.
- Extracting information from twitter for the search key with #conspiracy.[Positive labels]
- Extracting information from twitter for the search key.[Negative labels]
- The above will be used as the training set.
- The extracted text then has to be cleaned. It is a part of NLP
- Removing symbols, single characters, numbers, etc.,
- Removing stop words
- POS(Part-of-speech) tagger is used and extracted "noun, verb, adjective" that are relative.
- Lemmatizing based on POS(P)(eg: going, go => go)
- There will always be more unwanted words that appear rarely. So based on frequency top 30 words from both positive and negative words are choosen.
- The others words were removed.
- This is converted to TFIDF(Term frequency Inverse Document Frequency) vectorizor.
- This was converted into a dataframe. To ascess the frequency of each words.
- Using KNN(k-nearest neighbor) Algorithm the classification algorithm was implemented from scratch. The distance metric used is euclidean distance with k=7.Compared to Random forest algorithm this seemed to work well for the prediction.
- From the google search results, all the cleaning process done for the trainig set has been repeated.
- Then the KNN algorithm was used to test the classification results. The application was converted into the gif file for easy understanding.