This codebase helps to extract skills of people from a CSV files and create tags of those skills. These tags are then used to create a skill taxonomy and assign these tags from this taxonomy to the emoployees mentioned in the CSV file.
- cluster_skills.py - Consists of the parallelized clusterization algorithm to make the skill taxonomy broader. Low
n_clusters
means more generalization and vice versa to be more specific. - utils.py - Main file that consists the logic to generate skills_taxonomy.txt and individual_skills.csv.
- app.py
- individual_skills.csv - A dataframe consists of 2 columns
Name
andSkills
for every employee. - skills_taxonomy.txt - List of Skills that were generated from the initial dataset after clusterization.
- postprocessing.py - In case you need a more refined output, i.e., make the skill taxonomy more broader or more specific. It generates individual_skills_refined.csv and skills_taxonomy_refined.txt
- individual_skills_refined.csv - It has the same format as individual_skills.csv after running postprocessing.py.
- skills_taxonomy_refined.txt - It has the same format as skills_taxonomy.txt after running postprocessing.py.
- logs.txt - Consists of the logs of an example run of utils.py
- Create a
CSV
that has a column called "Skill Sets" that consists of skills defined in natural language for employees. - Create a
.env
file and defineOPENAI_API_KEY
environment variable. (Number of API calls will be equal to the number of rows in your CSV.) python3 -m venv venv
pip install -r requirements.txt
- Run
python3 utils.py > logs.txt 2>&1
- To get a more refined output run
python3 postprocessing.py --n_clusters 100
The application is deployed in this link: Skill Extractor UI