EML4U Drift Detector Comparison

This is the code repository of the article Drift Detection in Text Data with Document Embeddings (2021) by Feldhans, Wilke, Heindorf, Shaker, Hammer, Ngonga Ngomo, and Hüllermeier. (DOI, PDF preprint, BibTeX)

Installation notes

For specific software versions, check the requirements.txt

Uses embedding/BertHuggingface from https://github.com/UBI-AGML-NLP/Embeddings at version 1.0

cd /path/to/Embeddings/
git checkout 1.0
pip3 install -U .

Alibi Detect

# Erros with tensorflow 2.4.1 and 2.3.0
pip3 uninstall tensorflow
pip3 install -U tensorflow==2.4.0

pip3 install alibi-detect==0.7.0
pip3 install nlp

Dynamic Adapting Window Independence Drift Detection (DAWIDD)

Clone this git repository and add it to your PTHONPATH environment variable or to sys.path before importing detectors.

Activate an environment (example)

conda info --envs
conda activate EML4U

How to run these experiments

Data access

Download the Amazon and Twitter base datasets

Amazon: https://snap.stanford.edu/data/web-Movies.html
- location: data/movies/movies.txt.gz and data/movies/movies.txt
Twitter: https://www.kaggle.com/manchunhui/us-election-2020-tweets
- location: data/twitter/hashtag_donaldtrump.csv and data/twitter/hashtag_joebiden.csv

Data preparation

Amazon

run amazon_movie_sorter.py
- Sorts datasets by helpfulness.score.time and saves it along with text.
- In: data/movies/movies.txt.gz
- Out: data/movies/embeddings/amazon_raw.pickle

Twitter

filter_tweets.py
- Converts the tweets into a more easily readable format and filters out malformed data points
- In: hashtag_joebiden.csv
- In2: hashtag_donaldtrump.csv
- Out: election_dataset_raw.pickle

Model preparation

BERT

finetune_amazon_bert_768.py
- finetunes a BERT model for 10 epochs and saves each one
- In: data/movies/embeddings/amazon_raw.pickle
- Out: data/movies/movie_{1-9}e/

BoW

word2vec/doc2vec.py
- creates a BoW model for amazon data
word2vec/doc2vec_twitter_election.py
- creates a BoW model for the twitter data

Embedding generation

generate_all_the_datasets.py
- generates all embeddings for all models and datasets in a predetermined order via seperate scripts
- see inside the scripts for more detail

Experiments

Amazon

Run (in any order)
- amazon_different_classes.ipynb
- amazon_same_dist.ipynb
- amazon_drift_induction.ipynb

Twitter

Run (in any order)
- twitter_different_classes.ipynb
- twitter_same_dist.ipynb
- twitter_drift_induction.ipynb
- twitter_different_dist.ipynb

Figures

Run evaluation/plots_injection.ipynb and evaluation
- this will create the basic figures used in the paper
Run evaluation/figure-diff_dist-results.sh and evaluation/figure-injection-results.sh
- this will merge the figures to what you see in the paper
Run evaluation/tweet_count_gen.ipynb
- this will create Figure 1 of the paper

Acknowledgments

This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080 A and B.

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
data/results		data/results
detectors		detectors
evaluation		evaluation
figures		figures
word2vec		word2vec
.gitignore		.gitignore
CITATION.bib		CITATION.bib
LICENSE		LICENSE
README.md		README.md
amazon_different_classes.ipynb		amazon_different_classes.ipynb
amazon_drift_induction.ipynb		amazon_drift_induction.ipynb
amazon_movie_drift.py		amazon_movie_drift.py
amazon_movie_generator.py		amazon_movie_generator.py
amazon_movie_sorter.py		amazon_movie_sorter.py
amazon_same_dist.ipynb		amazon_same_dist.ipynb
delete_specific_result.py		delete_specific_result.py
filter_tweets.py		filter_tweets.py
finetune_amazon_bert_768.py		finetune_amazon_bert_768.py
generate_all_the_datasets.py		generate_all_the_datasets.py
generator_amazon_movie_different_classes.py		generator_amazon_movie_different_classes.py
generator_amazon_movie_drift_data.py		generator_amazon_movie_drift_data.py
generator_amazon_movie_same_dist.py		generator_amazon_movie_same_dist.py
generator_twitter_diff_classes.py		generator_twitter_diff_classes.py
generator_twitter_diff_dists.py		generator_twitter_diff_dists.py
generator_twitter_drift_data.py		generator_twitter_drift_data.py
generator_twitter_same_dist.py		generator_twitter_same_dist.py
requirements.txt		requirements.txt
twitter_different_classes.ipynb		twitter_different_classes.ipynb
twitter_different_dist.ipynb		twitter_different_dist.ipynb
twitter_drift_induction.ipynb		twitter_drift_induction.ipynb
twitter_same_dist.ipynb		twitter_same_dist.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EML4U Drift Detector Comparison

Installation notes

Alibi Detect

Dynamic Adapting Window Independence Drift Detection (DAWIDD)

Activate an environment (example)

How to run these experiments

Data access

Data preparation

Amazon

Twitter

Model preparation

BERT

BoW

Embedding generation

Experiments

Amazon

Twitter

Figures

Acknowledgments

About

Releases 2

Packages

Contributors 4

Languages

License

EML4U/Drift-detector-comparison

Folders and files

Latest commit

History

Repository files navigation

EML4U Drift Detector Comparison

Installation notes

Alibi Detect

Dynamic Adapting Window Independence Drift Detection (DAWIDD)

Activate an environment (example)

How to run these experiments

Data access

Data preparation

Amazon

Twitter

Model preparation

BERT

BoW

Embedding generation

Experiments

Amazon

Twitter

Figures

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages