Skip to content

Latest commit

 

History

History
122 lines (80 loc) · 2.78 KB

File metadata and controls

122 lines (80 loc) · 2.78 KB

Topic modeling for Arabic Tweets

This code is adopted from this study BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique (code)

Please refer to this blog post for more details about this repository.

huggingface

Interactive demo

Table of Contents

Create conda environments

conda create -n AraTop  python=3.7 anaconda 
conda activate AraTop   

Install req

pip install bertopic 
pip install flair  

Dataset

The dataset is based on the ArabGend dataset 2022 [1] 108053 tweets

Getting the tweets ID from data file or from [1] or and then retrieve tweets using Twitter API

pip install twarc
twarc2 hydrate ids.txt tweets.json
twarc2 hydrate twitt_ID.txt tweets.json

Convert json file to CSV twarc

pip3 install --upgrade twarc-csv
twarc2 csv --no-json-encode-all tweets.json tweets_CSV.csv
csvcut --columns id,text tweets_CSV.csv

To clean and pre-process the dataset

python arabic_cleaner.py

[1] ArabGend:Gender Analysis and Inference on Arabic Twitter

Training

For Topic modeling via umap

run_umap.sh

For Topic modeling via HDBSCAN

run_hdbscan.sh

For joint model (umap+hdbscan)

run joint.sh 

Inference

loading the tranined model

python infer.py

Acknowledgment

The implementation of the project relies on resources from BERTopic, Huggingface Transformers, and SBERT. We thank the original authors for their well-organized codebase.