NLP: Clustering random Wikipedia articles

By: Ahmed Amine MAJDOUBI

Note: It takes about 5 mins to generate 300 Wikipedia articles. Keep that in mind when playing with the parameters.

The code is in the src directory. You will find 3 python files inside it:

config.py: set the parameters of the simulation.
algorithms.py: contains all the functions that I have crated and used in the main file.
main.py: the main function to execute.

The code generates a results.csv file containing the articles, detected language and cluster label.

The notebook covers the necessary documentation and an explanation of my approach to my submission for the NLP excercise. The goal of my submission was not to produce the best accuracy or implement complex NLP algorithms, but rather to show that my ability in writing a clear, robust and well documented code for data science related tasks. At the end of the notebook, I discussed what could have been done to further improve the segmentation task.

The code functions are divided into four main types:

Document Extraction: generate random articles of different languages from Wikipedia.
Language Detection: detect the languages on the extracted documents, and group them by their language.
Text Processing: process each document by tokenizing, stemming and removing stop words.
Document Clustering: cluster each language group of document using TF-IDF and K-MEANS.

For this submission, I used the following third party libraries:

wikipedia: Wikipedia API for python to access and parse data from Wikipedia.
langdetect: Language detection library ported from Google's language-detection. Supports 55 languages.
nltk: Provides the necessary tools for symbolic and statistical natural language processing.

You can install these libraries using pip by uncommenting the following code:

pip install wikipedia
pip install langdetect
pip install nltk

The parameters in the config file are:

LANGUAGES: list of the languages of the randomly generated articles
LANG_DICT: dictionnary of the language codes and the full name of the languages (necessary for NLTK library).
ARTICLES_PER_LANG: number of documents to generate per language.
NCLUSTERS: number of clusters to generate for each language group.
NTERMS: number of keywords to generate for each cluster.
PLOT2D: plot 2D graph of the clusters per language.

The default parameters of the file are 100 random Wikipedia paer language. The languages are English, French and Spanish. The number of clusters per language group is 3, the number of keywords to generate is 7, and the 2D plots will be genrated at the end.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
Dathena Hands-on Technical Notebook.ipynb		Dathena Hands-on Technical Notebook.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP: Clustering random Wikipedia articles

About

Releases

Packages

Languages

License

Mjidiba97/Automatic-Wikipedia-Articles-Clustering

Folders and files

Latest commit

History

Repository files navigation

NLP: Clustering random Wikipedia articles

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages