This repository provides a software pipeline in order to explain drift between two sets of documents using embeddings.
First experiments indicate that BERT document embeddings outperform Doc2Vec document embeddings.
- How to configure file storage and the default directory to read data
- Amazon movie reviews
- Data overview
- How to read with Amazon Pickle_Reader and access texts, embeddings, metadata
- How to read with Amazon Pickle_Splitter and get items, which are equally splitted
- Data is currently stored at Google Drive
- How to store interim results
- How to reduce dimensions
- How to create Wordclouds
- Goal: Reusable, complete and documented code (good for developers, reviewers, everyone)
- If you add new classes, please provide minimal code examples, put them into the
doc
directory and add a link above. - Directories
doc
: Documentation (e.g. how to read data)experiments
Jupyter notebooks (e.g. combine class instances into a process generating explanations)transformation
: Classes for data transformation (e.g. create embeddings, reduce dimensions)access
: Classes for data access (e.g. read or split embeddings)explanations
: Classes for the explanation process (e.g. handling ml models, generate explanations)scripts
: Small sets of commands (e.g. to synchronize repositories)
- How to name your code: PEP 8 - Style Guide for Python Code
This work has been supported by the German FederalMinistry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.