Skip to content

Automatic clustering of random Wikipedia articles from different languages

License

Notifications You must be signed in to change notification settings

Mjidiba97/Automatic-Wikipedia-Articles-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP: Clustering random Wikipedia articles

By: Ahmed Amine MAJDOUBI

Note: It takes about 5 mins to generate 300 Wikipedia articles. Keep that in mind when playing with the parameters.

The code is in the src directory. You will find 3 python files inside it:

  • config.py: set the parameters of the simulation.
  • algorithms.py: contains all the functions that I have crated and used in the main file.
  • main.py: the main function to execute.

The code generates a results.csv file containing the articles, detected language and cluster label.

The notebook covers the necessary documentation and an explanation of my approach to my submission for the NLP excercise. The goal of my submission was not to produce the best accuracy or implement complex NLP algorithms, but rather to show that my ability in writing a clear, robust and well documented code for data science related tasks. At the end of the notebook, I discussed what could have been done to further improve the segmentation task.

The code functions are divided into four main types:

  • Document Extraction: generate random articles of different languages from Wikipedia.
  • Language Detection: detect the languages on the extracted documents, and group them by their language.
  • Text Processing: process each document by tokenizing, stemming and removing stop words.
  • Document Clustering: cluster each language group of document using TF-IDF and K-MEANS.

For this submission, I used the following third party libraries:

  • wikipedia: Wikipedia API for python to access and parse data from Wikipedia.
  • langdetect: Language detection library ported from Google's language-detection. Supports 55 languages.
  • nltk: Provides the necessary tools for symbolic and statistical natural language processing.

You can install these libraries using pip by uncommenting the following code:

  • pip install wikipedia
  • pip install langdetect
  • pip install nltk

The parameters in the config file are:

  • LANGUAGES: list of the languages of the randomly generated articles
  • LANG_DICT: dictionnary of the language codes and the full name of the languages (necessary for NLTK library).
  • ARTICLES_PER_LANG: number of documents to generate per language.
  • NCLUSTERS: number of clusters to generate for each language group.
  • NTERMS: number of keywords to generate for each cluster.
  • PLOT2D: plot 2D graph of the clusters per language.

The default parameters of the file are 100 random Wikipedia paer language. The languages are English, French and Spanish. The number of clusters per language group is 3, the number of keywords to generate is 7, and the 2D plots will be genrated at the end.

About

Automatic clustering of random Wikipedia articles from different languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published